### 1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values.

   Design a pipeline that includes the following steps

   1. Use an automated feature selection method to identify the important features in the dataset.
 
   2. Create a numerical pipeline that includes the following steps: Impute the missing values in the numerical columns using the mean of the column values, Scale the numerical columns using standardisation
 
   3. Create a categorical pipeline that includes the following steps: Impute the missing values in the categorical columns using the most frequent value of the column, One-hot encode the categorical columns
 
   4. Combine the numerical and categorical pipelines using a ColumnTransformer
 
   5. Use a Random Forest Classifier to build the final model
 
   6. Evaluate the accuracy of the model on the test dataset

Import Libraries: Import necessary Python libraries such as numpy, pandas, and various components from scikit-learn.

In [2]:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

Loading Our Data

In [6]:
data = pd.read_csv("C://Users//Susheel Yadav//Desktop//CustomerChurn.csv")

Showing Data 

In [17]:
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


Split Data: Split the dataset into features (X) and the target variable (y), where X contains all the customer information except for the "Churn" column, and y contains the "Churn" column indicating whether a customer churned or not.

In [7]:
X = data.drop('Churn', axis=1)
y = data['Churn']

Split Training and Test Sets: Split the data into training and test sets using train_test_split from scikit-learn. This is essential for model evaluation.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Define Numerical and Categorical Features: Identify the numerical and categorical features in the dataset based on their data types.

In [9]:
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

Numerical and Categorical Pipelines: Create separate preprocessing pipelines for numerical and categorical features.

For numerical features: Impute missing values with the mean and scale the values using standardization (z-score scaling).

For categorical features: Impute missing values with the most frequent value and one-hot encode the categories.

In [11]:
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

ColumnTransformer: Combine the two preprocessing pipelines into one using ColumnTransformer. This step handles the different preprocessing steps for numerical and categorical features separately.

In [12]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ])

Feature Selection: Use a feature selection method (SelectFromModel with a Random Forest Classifier) to select the most important features from the dataset. This helps reduce the dimensionality of the data and focus on the most relevant information.

Final Classifier Pipeline: Create the final pipeline that consists of preprocessing (including feature selection) and a Random Forest Classifier as the predictive model.

In [13]:
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('feature_selector', SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=0))),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=0))
])

Fit the Pipeline: Fit the entire pipeline, including preprocessing and model building, to the training data.

In [14]:
pipeline.fit(X_train, y_train)

Make Predictions: Use the trained pipeline to make predictions on the test dataset.

In [15]:
y_pred = pipeline.predict(X_test)

Evaluate Model: Calculate the accuracy of the model by comparing the predicted values to the actual values in the test set using the accuracy_score function from scikit-learn.

In [16]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Accuracy: 0.7998580553584103


**Interpretation of Results:**

An accuracy of around 80% is a reasonable starting point, indicating that your model is making correct predictions for a substantial portion of the test data.

However, it's important to consider the context and business requirements. In some industries, an 80% accuracy may be acceptable, while in others, it may need further improvement.

**Possible Improvements:**

1. **Feature Engineering:** Try adding new features or transforming existing ones to capture hidden patterns in the data.

2. **Hyperparameter Tuning:** We can use techniques like grid search or random search to find the best combination of hyperparameters for our specific problem.

3. **Imbalanced Classes:** Consider techniques like oversampling the minority class or using different evaluation metrics (e.g., F1-score) that account for class imbalance.

4. **Ensemble Models:** Experiment with ensemble models like Gradient Boosting or XGBoost, which often perform well in classification tasks.

5. **Cross-Validation:** Cross-validation helps us get a better estimate of our model's performance by evaluating it on multiple folds of the data.

6. **Feature Importance Analysis:** Analyze the feature importances determined by our Random Forest model. We may discover that some features are not contributing much to the model's performance.

7. **Domain Knowledge:** Consult with domain experts to gain insights into the factors that contribute to customer churn. Their expertise can guide feature selection and engineering.

8. **Regularization:** Experiment with adding regularization techniques to oour model to reduce overfitting. Random Forests are less prone to overfitting, but it can still occur in some cases.

9. **Evaluate Other Metrics:** Besides accuracy, consider other evaluation metrics like precision, recall, and the confusion matrix to get a more comprehensive view of our model's performance, especially if class imbalance is a concern.

### 2. Build a pipeline that includes a random forest classifier and a logistic regression classoifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy

In [18]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

In [19]:
iris = load_iris()
X, y = iris.data, iris.target

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [22]:
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=0)
lr_classifier = LogisticRegression(random_state=0)

In [23]:
voting_classifier = VotingClassifier(
    estimators=[('rf', rf_classifier), ('lr', lr_classifier)],
    voting='hard'  
)

In [24]:
pipeline = Pipeline([
    ('voting', voting_classifier)
])

In [25]:
pipeline.fit(X_train, y_train)

In [26]:
y_pred = pipeline.predict(X_test)

In [27]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Accuracy: 1.0
