# Mobile Price Classification
classify mobile price range

**Best results f1_score : 0.834 % accuracy : 0.9551051666666667**

## Overview 
### 1) Context

### 2) Describtion 

### 3) Used Python Libraries

### 4) Exploratory Data Analysis (EDA) 

### 5) Data Normalization

### 6) Feature Selection with Visualization

### 7) Feature Selection with Hypothesis test 

### 8) Model Buliding 

### 9) Receiver Operating Characteristic Score (ROC AUC) 

### 10) Conclusion 

### 11) Applying Algorithm 






## Context
Bob has started his own mobile company. He wants to give tough fight to big companies like Apple,Samsung etc.

He does not know how to estimate price of mobiles his company creates. In this competitive mobile phone market you cannot simply assume things. To solve this problem he collects sales data of mobile phones of various companies.

Bob wants to find out some relation between features of a mobile phone(eg:- RAM,Internal Memory etc) and its selling price. But he is not so good at Machine Learning. So he needs your help to solve this problem.

In this problem you do not have to predict actual price but a price range indicating how high the price is


![](https://s3b.cashify.in/gpro/uploads/2019/07/09100223/mobile-phone-evolution.jpg)


## Describtion

1. Id : id
2. battery_power: Total energy a battery can store in one time measured in mAh
3. blue     : Has bluetooth or not 
4. clock_speed : speed at which microprocessor executes instructions 
5. dual_sim : Has dual sim support or not 
6. fc : Front Camera mega pixels 
7. four_g  : Has 4G or not 
8. int_memory : Internal Memory in Gigabytes 
9. m_dep :   Mobile Depth in cm 
10. mobile_wt : Weight of mobile phone 
11. n_cores :  Number of cores of processor 
12. pc : Primary Camera mega pixels 
13. px_height : Pixel Resolution Height 
14. px_width : Pixel Resolution Width 
15. ram : Random Access Memory in Megabytes 
16. sc_h : Screen Height of mobile in cm 
17. sc_w : Screen Width of mobile in cm 
18. talk_time  : longest time that a single battery charge will last when you are 
19. three_g : Has 3G or not 
20. touch_screen : Has touch screen or not 
21. wifi  : Has wifi or not 
22. price_range : This is the target variable with value of 0(low cost), 1(medium cost), 2(high cost) and 3(very high cost). 


## Used Python Libraries

In [None]:
from sklearn.metrics import confusion_matrix ,classification_report,precision_score, recall_score ,f1_score, roc_auc_score 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

## Know Dataset Nature
1. head() : It is used to get the **first 5 rows** of the dataframe.
2. tail() : It is used to get the **last 5 rows** of the dataframe.
3. describe() : It is used to view some **basic statistical details** like percentile, mean, std etc.
4. info() : It is used to print a **concise summary** of a DataFrame. including the **index dtype and column dtypes, non-null values and memory usage.**

In [None]:

#/kaggle/input/mobile-price-classification/train.csv
#/kaggle/input/mobile-price-classification/test.csv

traindata = pd.read_csv('/kaggle/input/mobile-price-classification/train.csv')
testdata = pd.read_csv('/kaggle/input/mobile-price-classification/test.csv')

In [None]:
traindata.head()

In [None]:
traindata.tail()

In [None]:
traindata.describe()

In [None]:
traindata.info()

## Light Data Exploration
### 1) For numeric data 
* Made histograms to understand distributions 
* Corrplot 
* Pivot table comparing survival rate across numeric variables 


### 2) For Categorical Data 
* Made bar charts to understand balance of classes 
* Made pivot tables to understand relationship with survival 

In [None]:
data_num = traindata[['battery_power',  'clock_speed' , 'fc','int_memory','m_dep', 'mobile_wt','n_cores', 'pc',
                      'px_height','px_width','ram', 'sc_h', 'sc_w', 'talk_time']]

data_cat = traindata[['blue','dual_sim', 'four_g','three_g','touch_screen', 'wifi']]

In [None]:
for i in data_num.columns:
    plt.hist(data_num[i])
    plt.title(i)
    plt.show()

In [None]:
sns.heatmap(data_num.corr())

In [None]:
pd.pivot_table(traindata, index='price_range', values=['battery_power',  'clock_speed' , 'fc','int_memory', 'mobile_wt', 
                                                       'pc', 'px_height','px_width','ram', 'sc_h', 'sc_w', 'talk_time'])

In [None]:
for i in data_cat.columns:
    sns.barplot(data_cat[i].value_counts().index,data_cat[i].value_counts()).set_title(i)
    plt.show()

In [None]:
# fc, px_height, three_g

In [None]:
for i in data_cat:
    print(pd.pivot_table(traindata,index='price_range',columns=i, values='ram'))
    print("=="*20)

In [None]:
# dealing with outliers values
for i in data_num.columns:
    sns.boxplot(data_num[i])
    plt.title(i)
    plt.show()

## Data Normalization

1. interquartile range: It is hlep us to find **outlier values** in th columns.
2. outlinefree() : It is a customise function that help us to figureout and work on outlier values in columns. meanly, it is used to **remove outlires** values from dataset.    
3. for-loop: with the help of for-loop, we are checking the **outlinefree()** function worked properly or not.

In [None]:
def outlinefree(dataCol):     
      
    sorted(dataCol)                          # sort column
    Q1,Q3 = np.percentile(dataCol,[25,75])   # getting 25% and 75% percentile
    IQR = Q3-Q1                              # getting IQR 
    LowerRange = Q1-(1.5 * IQR)              # getting Lowrange
    UpperRange = Q3+(1.5 * IQR)              # getting Upperrange 
    
    colname = dataCol.tolist()               # convert column into list  
    newlist =[]                              # empty list for store new values
    for i in range(len(colname)):
        
        if colname[i] > UpperRange:          # list number > Upperrange 
            colname[i] = UpperRange          # then number = Upperrange
            newlist.append(colname[i])       # append value to empty list
        elif colname[i] < LowerRange:        # list number < Lowrange 
            colname[i] = LowerRange          # then number = Lowrange
            newlist.append(colname[i])       # append value to empty list 
        else:
            colname[i]                       # list number
            newlist.append(colname[i])       # append value to empty list
            
        

    return newlist

In [None]:
for i in range(len(data_num.columns)):
    new_list =  outlinefree(traindata.loc[:,data_num.columns[i]]) # retrun new list
    traindata.loc[:,data_num.columns[i]] = new_list               # new list = data.columns

In [None]:
data_final_num = traindata[['battery_power',  'clock_speed' , 'fc','int_memory','m_dep', 'mobile_wt','n_cores', 'pc',
                      'px_height','px_width','ram', 'sc_h', 'sc_w', 'talk_time']]

In [None]:

for i in data_final_num.columns:
    sns.boxplot(data_final_num[i])
    plt.title(i)
    plt.show()

## Feature Selection

1. seaborn.pairplot(): It is help to figure-out relation between features and label.

In [None]:
sns.pairplot(traindata)

## Feature Selection with hypothesis test
1. Chi-test: It is help to figure-out relation between features and label with **"pvalue <= 0.1"**

In [None]:
ct = pd.crosstab(traindata['wifi'],traindata['price_range'])
from scipy.stats import chi2_contingency
stat,pvalue,dof,expected_R = chi2_contingency(ct)
print("pvalue : ",pvalue)

if pvalue <= 0.1:
    print("Alternate Hypothesis passed. int_memory and price_range have Relationship")
else:
    print("Null hypothesis passed. int_memory and price_range doesnot have  Relationship")

In [None]:
features = traindata.loc[:,["battery_power","int_memory", "ram","sc_w"]].values
label = traindata.iloc[:,-1].values

## Model Buliding
here we will be using many algorithms and compare all of them. which algorithm will be giving us a Better result. The following algorithms are below.

1. Logistic Regression (f1 score: 0.782 )
2. naive bayes (f1 score: 0.278)
3. support vector classification (f1 score: 0.266 )
4. **DecisionTreeClassifier (f1 score: 0.834)**
5. RandomForestClassifier (f1 score: 0.262)

In [None]:
#------------------------LogisticRegression-----------------------
X_train, X_test, y_train, y_test= train_test_split(features,label, test_size= 0.25, random_state=167)

classimodel= LogisticRegression(solver="liblinear")  
classimodel.fit(X_train, y_train)
trainscore =  classimodel.score(X_train,y_train)
testscore =  classimodel.score(X_test,y_test)  

print("test score: {}".format(testscore),'\n')
y_predlogi =  classimodel.predict(X_test)
print(' f1 score: ',f1_score(y_test, y_predlogi,average='micro'),'\n')
print(confusion_matrix(y_test, y_predlogi))


In [None]:
print(' precision score: ',precision_score(y_test, y_predlogi,average='micro'),'\n')
print(' recall score: ',recall_score(y_test, y_predlogi,average='micro'),'\n')
print(classification_report(y_test, y_predlogi))

In [None]:
#------------------------------naive bayes---------------------------
X_train, X_test, y_train, y_test= train_test_split(features,label, test_size= 0.25, random_state=120) 

NBmodel = GaussianNB()  
NBmodel.fit(X_train, y_train) 

trainscore =  NBmodel.score(X_train,y_train)
testscore =  NBmodel.score(X_test,y_test)  

print("test score: {} train score: {}".format(testscore,trainscore),'\n')
y_predNB =  NBmodel.predict(X_test)
print(' f1 score: ',f1_score(y_test, y_predNB,average='micro'),'\n')
print(confusion_matrix(y_test, y_predNB))

In [None]:
print(' precision score: ',precision_score(y_test, y_predNB,average='micro'),'\n')
print(' recall score: ',recall_score(y_test, y_predNB,average='micro'),'\n')
print(classification_report(y_test, y_predNB))

In [None]:
#-------------------------------- support vector classification -------------------------------------  
X_train, X_test, y_train, y_test= train_test_split(features,label, test_size= 0.25, random_state=39) 

svcmodel = SVC(probability=True)  
svcmodel.fit(X_train, y_train) 

trainscore =  svcmodel.score(X_train,y_train)
testscore =  svcmodel.score(X_test,y_test)  

print("test score: {} ".format(testscore),'\n')

y_predsvc =  svcmodel.predict(X_test)
print(' f1 score: ',f1_score(y_test, y_predsvc,average='micro'),'\n')
print(confusion_matrix(y_test, y_predsvc))

In [None]:
print(' precision score: ',precision_score(y_test, y_predsvc,average='micro'),'\n')
print(' recall score: ',recall_score(y_test, y_predsvc,average='micro'),'\n')
print(classification_report(y_test, y_predsvc))

In [None]:
#------------------------Decision Tree-----------------------
X_train, X_test, y_train, y_test= train_test_split(features,label, test_size= 0.25, random_state=194)

DTmodel=  DecisionTreeClassifier(max_depth=4)  
DTmodel.fit(X_train, y_train)
trainscore =  DTmodel.score(X_train,y_train)
testscore =  DTmodel.score(X_test,y_test)
y_predDT =  DTmodel.predict(X_test)
print(' f1 score: ',f1_score(y_test, y_predDT,average='micro'),'\n')
print(confusion_matrix(y_test, y_predDT))

In [None]:
print(' precision score: ',precision_score(y_test, y_predDT,average='micro'),'\n')
print(' recall score: ',recall_score(y_test, y_predDT,average='micro'),'\n')
print(classification_report(y_test, y_predDT))

In [None]:
#------------------------Random Forest-----------------------
X_train, X_test, y_train, y_test= train_test_split(features,label, test_size= 0.25, random_state=39)

RFmodel=  RandomForestClassifier(criterion='entropy',max_depth=4) 
RFmodel.fit(X_train, y_train)
trainscore =  RFmodel.score(X_train,y_train)
testscore =  RFmodel.score(X_test,y_test)  
y_predRF =  RFmodel.predict(X_test)
print(' f1 score: ',f1_score(y_test, y_predRF,average='micro'),'\n')
print(confusion_matrix(y_test, y_predRF))

In [None]:
print(' precision score: ',precision_score(y_test, y_predRF,average='micro'),'\n')
print(' recall score: ',recall_score(y_test, y_predRF,average='micro'),'\n')
print(classification_report(y_test, y_predRF))

## Receiver Operating Characteristic Score (ROC AUC)
here we will be using many algorithms and compare all of them. which algorithm will be giving us a Better result. The following algorithms are below.

1. Logistic Regression (auc: 0.9362343333333334)
2. naive bayes (auc: 0.9429046666666667)
3. support vector classification (auc: 0.9649543333333334)
4. DecisionTreeClassifier (auc: 0.9551051666666667)
5. **RandomForestClassifier (auc: 0.9655039999999999)**

In [None]:
#-------------------------------------- LogisticRegression -------------------------------------
probabilityValues = classimodel.predict_proba(features)
auc = roc_auc_score(label,probabilityValues,multi_class ='ovr')
print(auc)

In [None]:
#-------------------------------------- naive bayes -------------------------------------
probabilityValues = NBmodel.predict_proba(features)
auc = roc_auc_score(label,probabilityValues,multi_class ='ovr')
print(auc)

In [None]:
#-------------------------------------- support vector classification -------------------------------------
probabilityValues = svcmodel.predict_proba(features)
auc = roc_auc_score(label,probabilityValues,multi_class ='ovr')
print(auc)

In [None]:
#-------------------------------------- Decision Tree -------------------------------------
probabilityValues = DTmodel.predict_proba(features)
auc = roc_auc_score(label,probabilityValues,multi_class ='ovr')
print(auc)

In [None]:
#-------------------------------------- Random Forest -------------------------------------
probabilityValues = RFmodel.predict_proba(features)
auc = roc_auc_score(label,probabilityValues,multi_class ='ovr')
print(auc)

## Conclusion
I will choose a **Decision Tree** algorithm for this test-dataset.

Decision Tree 
1. **f1_score: 0.834**
2. **auc: 0.9551051666666667**

## Applying Algorithm
1. we have to separet relational columns from the test dataset that will be columns assign to a new dataset.
2.  now we are ready for applying the decision tree algorithm on the test dataset.
3. now we have model-predicted prices and we can assign a price column to the test dataset.

In [None]:
finaltestdata = testdata.loc[:,["battery_power","int_memory", "ram","sc_w"]]

In [None]:
 predicted_price = DTmodel.predict(finaltestdata)

In [None]:
predicted_price

In [None]:
testdata['price_range']=predicted_price

In [None]:
testdata.head()