<a href="https://colab.research.google.com/github/saumilhj/projects/blob/main/MobilePriceRange.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**MOBILE PRICE RANGE**

Dataset from Kaggle: https://www.kaggle.com/datasets/iabhishekofficial/mobile-price-classification

In [1]:
import pandas as pd
import plotly.express as px

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

### Import data

Data checked for NaN and duplicate values using Excel

In [2]:
mobile_data = pd.read_csv('train.csv')

In [3]:
mobile_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   battery_power  2000 non-null   int64  
 1   blue           2000 non-null   int64  
 2   clock_speed    2000 non-null   float64
 3   dual_sim       2000 non-null   int64  
 4   fc             2000 non-null   int64  
 5   four_g         2000 non-null   int64  
 6   int_memory     2000 non-null   int64  
 7   m_dep          2000 non-null   float64
 8   mobile_wt      2000 non-null   int64  
 9   n_cores        2000 non-null   int64  
 10  pc             2000 non-null   int64  
 11  px_height      2000 non-null   int64  
 12  px_width       2000 non-null   int64  
 13  ram            2000 non-null   int64  
 14  sc_h           2000 non-null   int64  
 15  sc_w           2000 non-null   int64  
 16  talk_time      2000 non-null   int64  
 17  three_g        2000 non-null   int64  
 18  touch_sc

In [4]:
mobile_data.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,...,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,...,1208,1212,1411,8,2,15,1,1,0,1


### Exploratory Data Analysis

Summary of all data

In [5]:
mobile_data.describe()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,...,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,1238.5185,0.495,1.52225,0.5095,4.3095,0.5215,32.0465,0.50175,140.249,4.5205,...,645.108,1251.5155,2124.213,12.3065,5.767,11.011,0.7615,0.503,0.507,1.5
std,439.418206,0.5001,0.816004,0.500035,4.341444,0.499662,18.145715,0.288416,35.399655,2.287837,...,443.780811,432.199447,1084.732044,4.213245,4.356398,5.463955,0.426273,0.500116,0.500076,1.118314
min,501.0,0.0,0.5,0.0,0.0,0.0,2.0,0.1,80.0,1.0,...,0.0,500.0,256.0,5.0,0.0,2.0,0.0,0.0,0.0,0.0
25%,851.75,0.0,0.7,0.0,1.0,0.0,16.0,0.2,109.0,3.0,...,282.75,874.75,1207.5,9.0,2.0,6.0,1.0,0.0,0.0,0.75
50%,1226.0,0.0,1.5,1.0,3.0,1.0,32.0,0.5,141.0,4.0,...,564.0,1247.0,2146.5,12.0,5.0,11.0,1.0,1.0,1.0,1.5
75%,1615.25,1.0,2.2,1.0,7.0,1.0,48.0,0.8,170.0,7.0,...,947.25,1633.0,3064.5,16.0,9.0,16.0,1.0,1.0,1.0,2.25
max,1998.0,1.0,3.0,1.0,19.0,1.0,64.0,1.0,200.0,8.0,...,1960.0,1998.0,3998.0,19.0,18.0,20.0,1.0,1.0,1.0,3.0


Range of battery power in training dataset

In [6]:
fig = px.histogram(mobile_data, x='battery_power', labels={'battery_power': 'Battery Power'},
                   title='Distribution of mobiles belonging to different ranges of battery power')
fig.show()

Touchscreen and Non-touchscreen phones

In [7]:
touch_phones = mobile_data.groupby(['touch_screen'], as_index=False)['blue'].count()

In [8]:
fig = px.pie(touch_phones, names='touch_screen', values='blue', labels={'touch_screen': 'Touch Screen', 'blue': 'No of mobiles'},
             title='Percentage of touchscreen and non-touchscreen phones', hole=0.6)
fig.show()

Number of phones under each price range

In [9]:
price_range_data = mobile_data.groupby(['price_range'], as_index=False)['blue'].count()

In [10]:
fig = px.bar(price_range_data, x='price_range', y='blue', labels={'price_range': 'Price Range', 'blue': 'No of mobiles'},
             title='Total number of phones under each price range')
fig.show()

There are exactly 500 phones under each category of price range

### Modelling and predictions

#### Training and validation split

In [11]:
X = mobile_data.drop('price_range', axis=1)
y = mobile_data['price_range']

In [12]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=0)

#### Random Forest

In [13]:
def rfmodel_accuracy(n_estimators):
  model = RandomForestClassifier(n_estimators=n_estimators, random_state=0)
  model.fit(X_train, y_train)
  preds = model.predict(X_valid)
  print(f'No of trees: {n_estimators}')
  print(classification_report(y_valid, preds))

In [14]:
for i in range(100, 250, 25):
  rfmodel_accuracy(i)

No of trees: 100
              precision    recall  f1-score   support

           0       0.94      0.98      0.96        95
           1       0.80      0.79      0.80        92
           2       0.76      0.74      0.75        99
           3       0.91      0.91      0.91       114

    accuracy                           0.86       400
   macro avg       0.85      0.86      0.85       400
weighted avg       0.86      0.86      0.86       400

No of trees: 125
              precision    recall  f1-score   support

           0       0.95      0.99      0.97        95
           1       0.81      0.80      0.81        92
           2       0.78      0.74      0.76        99
           3       0.91      0.93      0.92       114

    accuracy                           0.87       400
   macro avg       0.86      0.87      0.86       400
weighted avg       0.87      0.87      0.87       400

No of trees: 150
              precision    recall  f1-score   support

           0       0.96 

The classification reports generated by varying the number of trees in the random forest show that n_estimators = 175 is the optimal value to get the best f1 score.

The classification for price range 1 and 2 remains in the low 80s

#### XGBoost

In [15]:
def xgmodel_accuracy(n_estimators):
  xg_model = XGBClassifier(n_estimators=n_estimators, learning_rate=0.05, random_state=0)
  xg_model.fit(X_train, y_train)
  xg_preds = xg_model.predict(X_valid)
  print(f'No of trees: {n_estimators}')
  print(classification_report(y_valid, xg_preds))

In [None]:
for i in range(400, 750, 50):
  xgmodel_accuracy(i)

No of trees: 400
              precision    recall  f1-score   support

           0       0.99      0.97      0.98        95
           1       0.92      0.92      0.92        92
           2       0.88      0.91      0.90        99
           3       0.96      0.95      0.95       114

    accuracy                           0.94       400
   macro avg       0.94      0.94      0.94       400
weighted avg       0.94      0.94      0.94       400

No of trees: 450
              precision    recall  f1-score   support

           0       0.99      0.98      0.98        95
           1       0.93      0.92      0.93        92
           2       0.88      0.90      0.89        99
           3       0.95      0.95      0.95       114

    accuracy                           0.94       400
   macro avg       0.94      0.94      0.94       400
weighted avg       0.94      0.94      0.94       400



XGBoost algorithm's performance is better than the random forest as expected

Optimum estimators' value as seen above to get the best f1 score with a constant learning rate of 0.5 is 600