<h1>Model Selection with Pipeline and GridSearchCV For Beginners

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import KNNImputer
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
%matplotlib inline

In [None]:
train_df = pd.read_csv("../input/mobile-price-classification/train.csv")

In [None]:
train_df.shape

In [None]:
train_df.head().T

In [None]:
train_df.dtypes

In [None]:
train_df.describe(include="all").T

Finding the unique values in a column helps to find columns with low cardinality

In [None]:
train_df.nunique()

In [None]:
for i,col in enumerate(train_df.columns):
    print("-"*10)
    print(col)
    print("-"*10)
    print(train_df[col].unique())

Storing the label, categorical features and numerical featurs in separate lists helps in EDA across data types.
There are some columns with binary inputs (0,1) which must be considered categorical

In [None]:
label = "price_range"

In [None]:
#storing categorical features
cat_features = train_df.columns[[1,3,5,17,18,19]]
print(cat_features)

In [None]:
#storing numerical features
num_features = train_df.columns[(train_df.columns.isin(cat_features)==False) & (train_df.columns!=label)]
print(num_features)

In [None]:
num_features_with_missing = num_features[train_df[num_features].min()==0]
print(num_features_with_missing)

Not all the missing values are represented as NaN. In the numerical columns there are 4 columns ('fc', 'pc', 'px_height', 'sc_w') whch have few 0 entries. However, the variables "front camera"(fc), "primary camera"(pc) having 0 as an entry can bes assumed that the mobile doesn't have front/rear camera. But the other two variables "pixel height"(pc_height) and "screen_width"(sc_w) can't have 0 as their values. Hence, these must be marked as missing (NaN)

In [None]:
num_features_with_missing = num_features_with_missing[2:]
print(num_features_with_missing)

In [None]:
#marking the missing values in the above columns
for col in num_features_with_missing:
    train_df.loc[train_df[col]==0,col] = np.nan

In [None]:
#Computing the % of missing values per column
train_df.isnull().mean()*100

We can see that 'px_height' has only 0.1% (2 records) values missing. We can either drop these rows or impute them. The 'sc_w' column has 9% missing values. These values must be imputed.

Let us see if everything is fine with the "front camera (fc)" and "primary camera (pc)" columns.


There are no mobiles without primary camera having front camera, so it's ok

In [None]:
len(train_df.loc[(train_df["pc"]==0) & (train_df["fc"]!=0)])

There are few 4G mobiles without primary camera, which is generally not the case

In [None]:
len(train_df.loc[(train_df["four_g"]==1) & (train_df["pc"]==0)])

There are few mobiles with touch screen but without primary camera, this is also something weird

In [None]:
len(train_df.loc[(train_df["touch_screen"]==1) & (train_df["pc"]==0)])

There are also few mobiles with Wi-Fi enabled but no camera facility

In [None]:
len(train_df.loc[(train_df["wifi"]==1) & (train_df["pc"]==0)])

<h2>EDA

From the visualizations below, there are few observations to note:

1. RAM seems to be the most influential among the numerical varibles on the target variable.
2. Battery Power also looks influencing but there is no much difference in median battery power between price class 1 and 2. However price class 3 has the highest median battery power while price range 0 being the lowest

In [None]:
fig,ax = plt.subplots(7,2,figsize=(13,40))
i=r=c=0
for tgt,feat in zip([label]*len(num_features),num_features):
    if (i%2==0) & (i>0):
        r+=1
        c=0
    sns.boxplot(x=tgt,y=feat,data=train_df,ax=ax[r,c])
    medians = train_df[[tgt,feat]].groupby(tgt).median().reset_index()
    sns.lineplot(x=tgt,y=feat,data=medians,ax=ax[r,c],linewidth=5,color="black")
    ax[r,c].set_title("price_range vs "+feat)
    i+=1
    c+=1

plt.show()
    

In [None]:
for tgt,feat in zip([label]*len(cat_features),cat_features):
    cross_tab = pd.crosstab(index=train_df[feat],columns=train_df[tgt],normalize="columns")*100
    cross_tab.T.plot(kind="barh",stacked=True,figsize=(11,4),)
    plt.title("price_range vs "+feat)
    plt.xlabel("% of mobiles")
    plt.show()
    

In [None]:
fig = plt.figure(figsize=(15,15))
sns.heatmap(train_df[num_features].corr(),annot=True,fmt=".2f",mask=np.triu(train_df[num_features].corr()),cbar=False);

In [None]:
X_train,X_test,y_train,y_test = train_test_split(train_df.iloc[:,:-1],train_df.iloc[:,-1],test_size=0.2,random_state=11)

In [None]:
y_train.value_counts()

In [None]:
classifier_pipe = Pipeline(steps=(["knn_imputer",KNNImputer()],["classifier",DecisionTreeClassifier(random_state=11)]))


classifier_param_grid = [{
                      "classifier":[DecisionTreeClassifier(random_state=11)],
                      #"knn_imputer__n_neighbors":np.arange(3,22,2), #preprocessing hyperparameter tuning can also be done
                      "classifier__criterion":["gini","entropy"],
                      "classifier__max_depth":np.arange(10,21,2),
                      #"classifier__min_samples_split":np.arange(2,21,3),
                      #"classifier__min_samples_leaf":np.arange(1,10,2)
                     },

                     {
                      "classifier":[RandomForestClassifier(random_state=11)],
                      #"knn_imputer__n_neighbors":np.arange(3,22,2),
                      "classifier__criterion":["gini","entropy"],
                      "classifier__n_estimators":np.arange(50,1200,500),
                      #"classifier__min_samples_split":np.arange(2,21,3),
                      #"classifier__min_samples_leaf":np.arange(1,10,2)
                     }]


grid_cv = GridSearchCV(estimator=classifier_pipe,param_grid=classifier_param_grid,scoring="accuracy",cv=5)

In [None]:
grid_cv.fit(X_train,y_train)
print(f"BEST SCORE: {grid_cv.best_score_}")
final_classifier_1 = grid_cv.best_estimator_
print(f"VALIDATION_SCORE: {final_classifier_1.score(X_test,y_test)}")
print(f"\n\nBEST CLASSIFIER: {final_classifier_1}")

As the 'fc' and 'pc' columns doesn't seem to be proper. Let's check the model performance without including theses columns

In [None]:
grid_cv.fit(X_train.drop(columns=["fc","pc"]),y_train)
print(f"BEST SCORE: {grid_cv.best_score_}")
final_classifier_2 = grid_cv.best_estimator_
print(f'VALIDATION SCORE: {final_classifier_2.score(X_test.drop(columns=["fc","pc"]),y_test)}')
print(f"\n\nBEST CLASSIFIER: {final_classifier_2}")

Let's check the model performance by dropping the 'pc' column

In [None]:
grid_cv.fit(X_train.drop(columns=["pc"]),y_train)
print(f"BEST SCORE: {grid_cv.best_score_}")
final_classifier_3 = grid_cv.best_estimator_
print(f'VALIDATION SCORE: {final_classifier_3.score(X_test.drop(columns=["pc"]),y_test)}')
print(f"\n\nBEST CLASSIFIER: {final_classifier_3}")

let's check the model performance by dropping the 'fc' column

In [None]:
grid_cv.fit(X_train.drop(columns=["fc"]),y_train)
print(f"BEST SCORE: {grid_cv.best_score_}")
final_classifier_4 = grid_cv.best_estimator_
print(f'VALIDATION SCORE: {final_classifier_4.score(X_test.drop(columns=["fc"]),y_test)}')
print(f"\n\nBEST CLASSIFIER: {final_classifier_4}")

Since, removing the 'fc' and 'pc' columns gave the best accuracy. We'd drop them from the further inputs to the model. Hence, final_classifier_2 is considered to be the best model

In [None]:
FINAL_MODEL = final_classifier_2

In [None]:
FINAL_MODEL

In [None]:
X_test.drop(columns=["pc","fc"],inplace=True)

In [None]:
FINAL_MODEL.score(X_test,y_test)

In [None]:
pred = FINAL_MODEL.predict(X_test)

In [None]:
prediction_df = pd.DataFrame({"Actual":y_test,"Prediction":pred})

In [None]:
prediction_df.head()

In [None]:
print(classification_report(y_test,pred))

In [None]:
sns.heatmap(confusion_matrix(y_test,pred),annot=True,cbar=False)
plt.xlabel("Prediction")
plt.ylabel("Actual");