# Mobile Price Classification

## **Main Question:** Which features are most important in predicting a Mobile Phone's Price?


### Objective:
1. Perform **Exploratory Data Analysis** on this dataset in order to extract insights from the data
2. **Predict** the price_range of a mobile phone using feature(s) - independent variable(s)

**Data Dictionary -> below**

- ID: ID
- battery_power: Total energy a battery can store in one time measured in mAh
- blue: Has bluetooth or not
- clock_speed: speed at which microprocessor executes instructions
- dual_sim: Has dual sim support or not
- fc: Front Camera mega pixels
- four_g: Has 4G or not
- int_memory: Internal Memory in Gigabytes
- m_dep: Mobile Depth in cm
- mobile_wt: Weight of mobile phone
- n_cores: Number of cores of processor
- pc: Primary Camera mega pixels
- px_height: Pixel Resolution Height
- px_width: Pixel Resolution Width
- ram: Random Access Memory in Megabytes
- sc_h: Screen Height of mobile in cm
- sc_w: Screen Width of mobile in cm
- talk_time: longest time that a single battery charge will last when you are
- three_g: Has 3G or not
- touch_screen: Has touch screen or not
- wifi: Has wifi or not

Target variable:
- price_range: This is the target variable with value of 
0(low cost), 
1(medium cost), 
2(high cost) and 
3(very high cost).

# Exploratory Data Analysis (EDA)

#### Import necessary modules

In [None]:
import numpy as np 
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

**Data Familiarization**

In [None]:
df = pd.read_csv('/kaggle/input/mobile-price-classification/train.csv')

In [None]:
df.head()

In [None]:
print(f'Shape of dataframe: {df.shape}')

In [None]:
df.info()

In [None]:
df.describe()

**Data Cleaning**

In [None]:
df.isna().sum()

0 NaN values in the dataframe

In [None]:
print(df.duplicated().any())

0 duplicated values in the dataframe

In [None]:
df['price_range'].value_counts()

500 mobile phones in each of the following categories: low cost, medium cost, high cost, and very high cost

# **Data Visualization: Analyzing the Relationship Between Variables**

**Correlation Matrix**

In [None]:
corr = df.corr()

np.fill_diagonal(corr.values, 0)

corr.replace(0, np.nan, inplace=True)
plt.show()
corr

Correlation between variables visualized with sns.heatmap

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(corr, annot=True, cmap='Blues')

Display highest correlations between all of our variables 

In [None]:
corr.unstack().sort_values(kind='quicksort', na_position='first').drop_duplicates(keep='last')

Very high correlation between "price_range" and "ram" -- this means that we should use the ram variable in predicting the price range of a mobile phone when doing our Machine Learning prediction

Display highest correlations between price_range and the other features in our dataset

In [None]:
corr.abs()['price_range'].sort_values(ascending=False)

The highest correlations to our target variable (price_range) are:
- ram
- battery_power
- px_width
- px_height

### **Key Variables Visualizations**

In [None]:
sns.displot(df, x='ram')

In [None]:
sns.lmplot(x='ram', y='price_range', data=df, line_kws={'color': 'purple'})
plt.yticks([0, 1, 2, 3])
plt.xlabel('Ram')
plt.ylabel('Price Range')
plt.show()

The plot aboves shows the high correlation between ram and price range. It shows the general pattern: as ram increases, mobile's price increases

In [None]:
sns.boxplot(x='price_range', y='battery_power', data=df)
plt.xlabel('Price Range')
plt.ylabel('Battery Power')
plt.title('Battery Power\'s correlation to Price Range', weight='bold')
plt.show()

In [None]:
four_g = df['four_g'].value_counts()
plt.title('Percentage of Mobiles with 4G', weight='bold')
labels_4g = ['4G', 'No 4G']
four_g.plot.pie(autopct="%.1f%%", labels=labels_4g)
plt.show()

In [None]:
n_cores = df['n_cores'].value_counts()
plt.title('Number of cores in mobile phones\n\n', weight='bold')
n_cores.plot.pie(autopct="%.1f%%", radius=1.5)
plt.show()

Next, we'll use plotly to visualize the 3 most highly correlated variables to price_range

In [None]:
import plotly.express as px
fig = px.scatter_3d(df.head(1000), x='ram', y='battery_power', z='px_width', color='price_range')
fig.show()

Above, we see how ram, battery power, and px height all contribute to a mobile phone's price classification

***

# **Machine Learning: Prediction**

We will predict the price_range of a mobile using all features in the dataframe (excluding price_range, of course)

In [None]:
X = df.drop('price_range', axis=1)
y = df['price_range']

**Train-Test-Split**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=100)

## Modeling (Part 1)

1. K-Nearest Neighbors
2. Linear Regression
3. RandomForest

You may be thinking "Why are we using Linear Regression (a regression algorithm) on a classification problem?" Well, because our y variable will either be 0, 1, 2, or 3 our model treats it as if it's a regression problem and runs. Moreover, it performs well because it evaluates our 4 categories (above) as quantitative variables (not qualitative) and uses linear regression to find the optimal price range for each observation, which is then classified into either 0, 1, 2, or 3.

In [None]:
models = {'KNN': KNeighborsClassifier(),
         'Linear Regression': LinearRegression(),
         'Random Forest': RandomForestClassifier()}
         
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    
    for name, model in models.items():
        model.fit(X_train, y_train)
        
        model_scores[name] = model.score(X_test, y_test)
    return model_scores

In [None]:
model_scores = fit_and_score(models=models, 
                             X_train=X_train,
                            X_test=X_test,
                            y_train=y_train,
                            y_test=y_test)
model_scores

### Model Comparison

In [None]:
model_comp = pd.DataFrame(model_scores, index=['accuracy'])
model_comp.T.plot.bar(); # .T accesses the attributes of an object (in this case, the scores)

All 3 of our models perform very well!

Next, we're going to tune the hyperparemters of our KNN and Random Forest models. Unfortunately, however, Linear Regression has a few hyperparameters which don't affect its overall score, and therefore, our final final score for our Linear Regression model is the score above.


#### Hyperparameter tuning: KNeighborsClassifier

In [None]:
train_scores = []

test_scores = []

neighbors = range(1, 21)

knn = KNeighborsClassifier()

for i in neighbors:
    knn.set_params(n_neighbors = i)
    
    knn.fit(X_train, y_train)
    
    train_scores.append(knn.score(X_train, y_train))
    
    test_scores.append(knn.score(X_test, y_test))

In [None]:
plt.plot(neighbors, train_scores, label="Train Scores")
plt.plot(neighbors, test_scores, label="Test Scores")
plt.xticks(np.arange(1, 21, 1))
plt.xlabel("Number of neighbors")
plt.ylabel("Model score")
plt.legend()

print(f"Maximum KNN score on the test data: {max(test_scores)*100:.2f}%")

Looking at the graph above, n_neighbors = 13 seems to be the best choice. Now, we will apply this!

In [None]:
knn = KNeighborsClassifier(n_neighbors=13)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(f'KNN Model Score: {knn.score(X_test, y_test) * 100}%')

You might be thinking let's call it a day! Well before we do that, we are going to tune our Random Forest Model to ensure that we are using the best model!

#### Hyperparemeter tuning: Random Forest Model

##### Tuning Random Forest Classifier using RandomizedSearchCV

In [None]:
# Random Forest hyperparemeters (from sklearn documentation as well)

rf_grid = {"n_estimators": np.arange(10, 1000, 50),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(1, 20, 2)}

In [None]:
rs_rf = RandomizedSearchCV(RandomForestClassifier(),
                           param_distributions=rf_grid,
                           cv=5,
                           n_iter=20,
                           verbose=True)

rs_rf.fit(X_train, y_train);

rs_rf.best_params_

In [None]:
rs_rf.score(X_test, y_test)

##### Tuning Random Forest Classifier using GridSearchCV

In [None]:
rf_grid = {"n_estimators": np.arange(10, 1000, 50),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(1, 20, 2)}

In [None]:
gs_rf = RandomizedSearchCV(RandomForestClassifier(),
                           param_distributions=rf_grid,
                           cv=5,
                           n_iter=20,
                           verbose=True)

gs_rf.fit(X_train, y_train);

gs_rf.best_params_

In [None]:
gs_rf.score(X_test, y_test)

Well it looks like even after tuning our Random Forest Model, our KNN model still beats it! It was worth the effort though to ensure that are using the best model with the best hyperparemeters possible. But before a call it a day, let's try using an XGBoost model to see if it outperforms our KNN model.

## Modeling (Part 2)
- XGBoost

In [None]:
xgb = XGBClassifier(eval_metric='logloss', use_label_encoder=False)

xgb.fit(X_train, y_train)
xgb_pred = xgb.predict(X_test)
xgb.score(X_test, y_test)

#### Hyperparameter tuning: XGBoost

In [None]:
params_xgb = {'n_estimators': [50,100,250,400,600,800,1000], 
    'learning_rate': [0.2,0.5,0.8,1]}
    
rs_xgb =  RandomizedSearchCV(xgb, param_distributions=params_xgb, cv=5)
rs_xgb.fit(X_train, y_train)
xgb_pred_2 = rs_xgb.predict(X_test)
rs_xgb.score(X_test, y_test)

Even after tuning our XGBoost model's hyperparameter, it still does not perform as well as our KNN model (it even performs worse than our Linear Regression model). Now, we can be sure that we've selected the best model. Finally, we'll evaluate our best performing model (KNN) using other metrics!

## Final Model Evaluation

Finally, let's evaluate our model using some other metrics:
- Confusion Matrix
- Classification Report 
    - precision
    - recall
    - f1-score
    - support 
    - accuracy 
    - macro avg
    - weighted avg

### Confusion Matrix

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
import seaborn as sns
sns.set(font_scale=1.5) # Increase font size

def plot_conf_mat(y_test, y_preds):
    
    fig, ax = plt.subplots(figsize=(3, 3))
    ax = sns.heatmap(confusion_matrix(y_test, y_preds),
                     annot=True, # Annotate the boxes
                     cbar=False,
                    fmt='g', # no scientific notation
                    cmap='Blues')
    
    plt.xlabel("true label", weight='bold')
    plt.ylabel("predicted label", weight='bold')
    
plot_conf_mat(y_test, y_pred)

The confusion matrix above shows the breakdown of how our model correctly and incorrectly classified mobile phones price's

### Classification Report

In [None]:
print(classification_report(y_test, y_pred))

The classification report shows that beyond accuracy, our model performs very well!

#### Cross Validation

In [None]:
print(f'Cross Validation Scores: ' + str(cross_val_score(knn, X, y, cv=5)))

print(f'Cross Validation Score (Mean): ' + str(np.mean(cross_val_score(knn, X, y, cv=5))))

Looking at the results of the cross validation, we can be sure that even if we performed the split on our data differently, we'd still get similar (if not even strong) results

# Conclusion

First, to address our main question the most important features in predicting a mobile phone's price are ram, battery power, and pixel width! We figured this out by using a correlation matrix, specifically looking at the most highly correlated variables to price range.

In the next part of our notebook, we used machine learning to predict mobile phones price's using all of the features in our dataset. We saw that the best performing model was KNN -- outperforming Linear Regression, Random Forest, and even XGBoost. We were even able to improve our KNN model's score by tuning its hyperparameters (n_neighbors). Later on, we evaluated our KNN model using other metrics (besides accuracy) and saw that it performed very well by those metrics, as well. The fact that KNN was the best performing model alludes to the idea that sometimes the more complicated models might not be the best model for a given dataset. 

## Thank you for reading my notebook. Please upvote it, and leave comments -- it would be greatly appreciated!