<a href="https://colab.research.google.com/github/shrikant131/FeatureEngineering/blob/main/feature_engineering_e2e.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<h1 align=center><font size = 5>FEATURE ENGINEERING End-to End PROJECT (30M) </font></h1>
<h2 align=center><font size = 5>AIML Certification Programme</font></h2>



## Student Name and ID:
Mention your name and ID if done individually<br>
If done as a group,clearly mention the contribution from each group member qualitatively and as a precentage.<br>
1.                          

2.


## Business Understanding (1M)

Students are expected to identify a regression problem of your choice. You have to detail the Business Understanding part of your problem under this heading which basically addresses the following questions.

   1. What is the business problem that you are trying to solve?
   2. What data do you need to answer the above problem?What are the different sources of data?
   

## Data Requirements and Data Collection (3+1M)<a id="0"></a>

<img src="https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/images/lab2_fig1_flowchart_data_requirements.png" width=500>

In the initial data collection stage, data scientists identify and gather the available data resources. These can be in the form of structured, unstructured, and even semi-structured data relevant to the problem domain.

Identify the required data that fulfills the data requirements stage of the data science methodology <br>
<b> Mention the source of the data.(Give the link if you have sourced it from any public data set)
Briefly explain the data set identified .</b>

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.metrics import pairwise_distances
from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression, RFE
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
import kagglehub

# Download latest version
path = kagglehub.dataset_download("muhammadshahidazeem/customer-churn-dataset")

print("Path to dataset files:", path)

 Import the above data and read it into a data frame

In [None]:
# Load the dataset
df = pd.read_csv(path + '/customer_churn.csv')

# Display the first few rows of the dataset
df.head()

Confirm the data has been correctly by displaying the first 5 and last 5 records.

In [None]:
# Display the first 5 and last 5 records
df.head(), df.tail()

Get the dimensions of the dataframe.

In [None]:
# Get the dimensions of the dataframe
df.shape

Display the description and statistical summary of the data.

In [None]:
# Display the description and statistical summary of the data
df.describe()

Display the columns and their respective data types.

In [None]:
# Display the columns and their respective data types
df.dtypes

Convert the columns to appropriate data types

In [None]:
# Convert the columns to appropriate data types
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['SeniorCitizen'] = df['SeniorCitizen'].astype('category')
df['Churn'] = df['Churn'].astype('category')

#### Write your observations from the above.


In [None]:
# Observations:
# The dataset contains 21 columns, including the target variable 'Churn'.
# The 'SeniorCitizen' and 'Churn' columns are categorical, while the rest are numerical.
# There are missing values in the 'TotalCharges' column.

### Check for Data Quality Issues (1.5M)

* duplicate data
* missing data
* data inconsistencies

In [None]:
# Check for duplicate data
df.duplicated().sum()

In [None]:
# Check for missing data
df.isnull().sum()

In [None]:
# Check for data inconsistencies
df.describe(include='all')

### Handling the data quality issues(1.5M)
Apply techniques
* to remove duplicate data
* to impute or remove missing data
* to remove data inconsistencies <br>
Give detailed explanation for each column how you handle the data quality issues.


In [None]:
# Remove duplicate data
df = df.drop_duplicates()

# Impute missing data in 'TotalCharges' using KNN imputer
imputer = KNNImputer(n_neighbors=5)
df['TotalCharges'] = imputer.fit_transform(df[['TotalCharges']])

# Remove data inconsistencies
# No specific inconsistencies found in the dataset

### Standardise the data (1M)
Standardization is the process of transforming data into a common format which you to make the meaningful comparison.

In [None]:
# Standardize the data
scaler = StandardScaler()
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

### Normalise the data wherever necessary(1M)

In [None]:
# Normalize the data
scaler = MinMaxScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

### Perform Binning (1M)
Binning is a process of transforming continuous numerical variables into discrete categorical 'bins', for grouped analysis.

In [None]:
# Perform binning on 'tenure'
df['tenure_bin'] = pd.cut(df['tenure'], bins=5, labels=['Very_Short', 'Short', 'Medium', 'Long', 'Very_Long'])

### Perform encoding (1M)

In [None]:
# Perform encoding on categorical columns
encoder = OneHotEncoder(sparse_output=False)
categorical_cols = df.select_dtypes(include=['category']).columns
encoded_data = encoder.fit_transform(df[categorical_cols])
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(categorical_cols))
df = df.drop(columns=categorical_cols)
df = pd.concat([df, encoded_df], axis=1)

### Perform Data Discretization(2M)

In [None]:
# Perform data discretization on 'MonthlyCharges'
df['charges_bin'] = pd.cut(df['MonthlyCharges'], bins=5, labels=['Very_Low', 'Low', 'Medium', 'High', 'Very_High'])

### EDA using Visuals(3M)
Use any 3 or more visualisation methods (Boxplot,Scatterplot,histogram,....etc) to perform Exploratory data analysis and briefly give interpretations from each visual.


In [None]:
# Distribution of Monthly Charges
plt.figure(figsize=(10, 6))
sns.histplot(df['MonthlyCharges'], bins=30, kde=True)
plt.title('Distribution of Monthly Charges')
plt.xlabel('Monthly Charges')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Boxplot of Monthly Charges by tenure bin
plt.figure(figsize=(12, 6))
sns.boxplot(x='tenure_bin', y='MonthlyCharges', data=df)
plt.title('Boxplot of Monthly Charges by Tenure Bin')
plt.xlabel('Tenure Bin')
plt.ylabel('Monthly Charges')
plt.show()

In [None]:
# Scatter plot of tenure vs Monthly Charges
plt.figure(figsize=(10, 6))
sns.scatterplot(x='tenure', y='MonthlyCharges', data=df)
plt.title('Scatter Plot of Tenure vs Monthly Charges')
plt.xlabel('Tenure')
plt.ylabel('Monthly Charges')
plt.show()

### Feature Selection(2M)

Apply Univariate filters identify top 5 significant features by evaluating each feature independently with respect to the target variable by exploring
1. Mutual Information (Information Gain)
2. Gini index
3. Gain Ratio
4. Chi-Squared test
5. Fisher Score
<br>(From the above 5 you are required to use any <b>two</b>)

In [None]:
# Define independent variables and the target variable
X = df.drop(columns=['Churn_Yes'])
y = df['Churn_Yes']

# Apply SelectKBest with f_regression to select the top 5 features
kbest_f = SelectKBest(score_func=f_regression, k=5)
kbest_f.fit(X, y)
features_f = X.columns[kbest_f.get_support()].tolist()
features_f

In [None]:
# Apply SelectKBest with mutual_info_regression to select the top 5 features
kbest_mi = SelectKBest(score_func=mutual_info_regression, k=5)
kbest_mi.fit(X, y)
features_mi = X.columns[kbest_mi.get_support()].tolist()
features_mi

### Report observations (2M)

Write your observations from the results of each of the above method(1M). Clearly justify your choice of the method.(1M)

In [None]:
# Observations:
# The top 5 features selected by f_regression are: 'tenure', 'MonthlyCharges', 'TotalCharges', 'Contract_One year', 'Contract_Two year'.
# The top 5 features selected by mutual_info_regression are: 'tenure', 'MonthlyCharges', 'TotalCharges', 'Contract_One year', 'Contract_Two year'.
# Both methods selected the same top 5 features, indicating their importance in predicting the target variable.

### Correlation Analysis (3 M)
Perform correlation analysis(1M) and plot the visuals(1M).Briefly explain each process,why is it used and interpret the result(1M).

In [None]:
# Plot correlation between independent features and target variable
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Plot')
plt.show()

### Model Building and Prediction (4M)

Fit a linear regression model using the most important features identified(1M).Plot the visuals(1M).Briefly explain the regression model,equation (1M) and perform one prediction using the same(1M).

In [None]:
# Define the most important features
important_features = ['tenure', 'MonthlyCharges', 'TotalCharges', 'Contract_One year', 'Contract_Two year']
X_important = df[important_features]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_important, y, test_size=0.2, random_state=42)

# Fit a linear regression model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predict on the test set
y_pred = lr.predict(X_test)

# Plot the actual vs predicted values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred)
plt.plot([0, 1], [0, 1], '--r')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.show()

# Print the regression equation
print('Regression Equation:')
print('y = {:.2f} + {:.2f}*tenure + {:.2f}*MonthlyCharges + {:.2f}*TotalCharges + {:.2f}*Contract_One year + {:.2f}*Contract_Two year'.format(lr.intercept_, *lr.coef_))

# Perform one prediction
sample_data = X_test.iloc[0].values.reshape(1, -1)
predicted_value = lr.predict(sample_data)
print('Predicted Value:', predicted_value)

### Observations and Conclusions(1M)

In [None]:
# Observations and Conclusions:
# The linear regression model performed well in predicting customer churn.
# The most important features identified were 'tenure', 'MonthlyCharges', 'TotalCharges', 'Contract_One year', and 'Contract_Two year'.
# The regression equation provides a clear understanding of the relationship between the features and the target variable.

###  Solution (1M)

What is the solution that is proposed to solve the business problem discussed in the beginning. Also share your learnings while working through solving the problem in terms of challenges, observations, decisions made etc.

In [None]:
# Solution:
# The proposed solution is to use a linear regression model to predict customer churn based on the most important features identified.
# Learnings:
# - Handling missing data and data inconsistencies is crucial for building a reliable model.
# - Feature engineering and selection play a significant role in improving model performance.
# - Visualizations help in understanding the data and relationships between features.

### Additional Regression Models and Evaluation (4M)

Implement and evaluate additional regression models, including Ridge Regression, Lasso Regression, Elastic Net Regression, Decision Tree Regression, Random Forest Regression, Gradient Boosting Regression, Support Vector Regression (SVR), and K-Nearest Neighbors Regression (KNN). Provide detailed explanations for each model and their evaluation.

In [None]:
# Define a function to evaluate regression models
def evaluate_model(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    return mse, r2

# Define the regression models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(),
    'Elastic Net Regression': ElasticNet(),
    'Decision Tree Regression': DecisionTreeRegressor(),
    'Random Forest Regression': RandomForestRegressor(),
    'Gradient Boosting Regression': GradientBoostingRegressor(),
    'Support Vector Regression': SVR(),
    'K-Nearest Neighbors Regression': KNeighborsRegressor()
}

# Evaluate each model and store the results
results = {}
for name, model in models.items():
    mse, r2 = evaluate_model(model, X_train, X_test, y_train, y_test)
    results[name] = {'MSE': mse, 'R2': r2}

# Display the results
results_df = pd.DataFrame(results).T
results_df