# Mobile Price Classification using lazy predict

lazy predict is a library that trains a large number of models on a given dataset to determine which one will work best for it

the goal is to predict a price range for a smartphone based on its specifications.

the specifcations include a total of 20 columns ranging from 3g availability to touch screen and amount of ram so a very extensive feature set. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns

In [None]:
train = pd.read_csv("../input/mobile-price-classification/train.csv")
test = pd.read_csv("../input/mobile-price-classification/test.csv")

after loading in the data, lets take a look at it 

In [None]:
train.head()

In [None]:
test.head()

## Data Analysis

In [None]:
# check data types
train.info()

In [None]:
# check if there are any null columns
train.isna().sum()

In [None]:
# describe the data
train.describe()

# Explortary Data Analaysis

In [None]:
# number of samples for each price range
fig, ax = plt.subplots(figsize = (10, 4))
sns.countplot(x ='price_range', data=train)
plt.xlabel("Class Label")
plt.ylabel("Number of Samples")
plt.show()

perfectly balanced, as all things should be.

In [None]:
# find correlation
corr_mat = train.corr()

# each columns correlation with the price
corr_mat['price_range']

In [None]:
# convert all to positive and sort by value
abs(corr_mat).sort_values(by=['price_range'])['price_range']

we can make a few observations from above
- the ram is the most deciding factor in price range with the highest correlation.
- the amount of pixels do matter after all.
- number of cores does not correlate with the price much (could be due to the cores being weak, for example most midrangers nowadays have 8 cores while the Apple A series SoCs have at most 6 cores and still perform miles better).

In [None]:
# battery correlation plot
fig, ax = plt.subplots(figsize=(14,10))
sns.boxenplot(x="price_range",y="battery_power", data=train,ax = ax)

In [None]:
# individual correlation graphs

# get all columns and remove price_range
cols = list(train.columns.values)
cols.remove('price_range')

# plot figure
fig, ax = plt.subplots(7, 3, figsize=(15, 30))
plt.subplots_adjust(left=0.1, bottom=0.05, top=1.0, wspace=0.3, hspace=0.2)
for i, col in zip(range(len(cols)), cols):
    ax = plt.subplot(7,3,i+1)
    sns.lineplot(ax=ax,x='price_range', y=col, data=train)

In [None]:
# plot full heatmap
figure(figsize=(20, 14))
sns.heatmap(corr_mat, annot = True, fmt='.1g', cmap= 'coolwarm')

# Modeling
knowing which model to build for a dataset is not an easy task, specially when the columns that have a high correlation with the target variable are less than half the total columns, its also a task that is time consuming in making and tuning these models that is why we will use the LazyPredict library to show us the results of various models without any tuneing and we will implement the top 3 models.

In [None]:
# extract target column
target = train['price_range']

# drop target column from dataset
train.drop('price_range', axis=1, inplace=True)

In [None]:
from sklearn.model_selection import train_test_split

# install and import lazypredict
!pip install lazypredict
from lazypredict.Supervised import LazyClassifier

# split training dataset to training and testing
X_train, X_test, y_train, y_test = train_test_split(train, target,test_size=.3,random_state =123)

# make Lazyclassifier model(s)
lazy_clf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)

# fit model(s)
models, predictions = lazy_clf.fit(X_train, X_test, y_train, y_test)

In [None]:
models

In [None]:
# plot the first 5 models F1 score
top = models[:5]
figure(figsize=(14, 7))
sns.lineplot(x=top.index, y="F1 Score", data=top)

we are not really intrested in the predictions dataframe here because we already know those values and they're part of the training dataset

from above we can see that the best algorithm for this type of task is logistic regression followed by Discriminant Analysis models and followed closely by GB models.

### Implemented models
- logistic regression
- Linear Discriminant Analysis
- light GBM classifier

the reason behing skipping on the Quadratic Discriminant Analysis model is because its of the same family as Linear Discriminant Analysis and produces similar results, we also want to implement a diverse range of models

In [None]:
from sklearn.linear_model import LogisticRegression
# Logistic regression
log_clf = LogisticRegression(random_state=0).fit(train, target)

In [None]:
# drop the id column from test to match the size of train
test.drop('id', axis=1, inplace=True)

In [None]:
# get predictions on test dataset and convert it to a dataframe
log_preds = pd.DataFrame(log_clf.predict(test), columns = ['log_price_range'])

log_preds.head()

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# Linear Discriminant Analysis
lda_clf = LinearDiscriminantAnalysis().fit(train, target)

In [None]:
# get predictions on test dataset and convert it to a dataframe
lda_preds = pd.DataFrame(lda_clf.predict(test), columns = ['lda_price_range'])

lda_preds.head()

In [None]:
from lightgbm import LGBMClassifier
# lightgbm model
lgbm_clf = LGBMClassifier(objective='multiclass', random_state=5).fit(train, target)

In [None]:
# get predictions on test dataset and convert it to a dataframe
lgbm_preds = pd.DataFrame(lgbm_clf.predict(test), columns = ['lgbm_price_range'])

lgbm_preds.head()

### comparing model results

In [None]:
# create dataframe with 3 columns and index from any of the predicted dataframes
results = pd.DataFrame(index=log_preds.index, columns=['log', 'lda', 'lgbm'])

# add in data from the 3 predicted dfs
results['log'] = log_preds
results['lda'] = lda_preds
results['lgbm'] = lgbm_preds

# show grouped df
results

In [None]:
# find columns where all 3 models agree on the result
equal_rows = 0
for row in results.itertuples(index=False):
    if(row.log == row.lda == row.lgbm):
        equal_rows += 1
        
equal_rows

from all the 1000 rows the 3 models agree on 62% which means any of these 3 algorithms should be n overall good choice for predicting the price range of a smartphone based on its specifications