### Outline

- <b> Read Dataset </b>
- <b> Data Cleaning </b>
    - Outliers
    - Uncorrelated columns
    - Filling null values
    
- <b> Data Visulaization </b>

- <b> Model Preparation </b>
    - Encoding categorical variables
    - Normalization 
    - Split Training and testing
    
- <b> Models and Tuning </b>
    - Logistic Regression
    - Naive Bayes
    - K-Nearest Neighbor
    - Decision Tree
    - Random Forest
    - Support Vector Machine
  
- <b> Model Evaluation </b>
    - Confusion Matrix
    - Precision
    - Recall
    - F1 Score


## 1. Read Dataset

In [None]:
import os
import pandas as pd

In [None]:
path = "../input/mobile-price-classification/"

In [None]:
df_train = pd.read_csv(os.path.join(path,'train.csv'))
df_test = pd.read_csv(os.path.join(path,'test.csv'))

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
df_train.info()

In [None]:
df_train.describe()

In [None]:
df_test.shape

## 2. Data Cleaning
- Outliers
- Uncorrelated columns
- Filling null values

### 2.1 Outliers

In [None]:
df_train.describe()

We can see almost all the features are perfect.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
plt.figure(figsize = (12,10));
sns.displot(x = df_train['px_height'], kind= 'kde');

There is no such outliers, it seems

.

###  2.2 Uncorrelated Columns

.

In [None]:
df_train.info()

In [None]:
corr = df_train.corr()

In [None]:
fig, ax = plt.subplots(figsize = (10,10))
sns.heatmap(corr, ax = ax, cmap = 'viridis',linewidth = 0.1);

From the above figure, we can observe that, 
- <b> ram </b> has higher correlation with <b>  price_range   </b> which is obviously true. 
- <b>   pc_width  </b> & <b>  pc_height   </b> is also correlated with price_range, surprisingly.
- <b>    battery_power </b> 
- However, <b> four_g </b>, <b> touch_screen </b> and <b> int_memory </b> has lower correlation.

In [None]:
df_train['blue'].value_counts()

### 2.3 Filling null values

In [None]:
df_train.isnull().sum()

In [None]:
df_test.isnull().sum()

As we can see there are no<b>    'nan' </b>values

## 3. Data Visualization


In [None]:
fig, ax = plt.subplots(figsize = (12,10))
sns.heatmap(corr, ax = ax, cmap = 'viridis');


- <b>  price_range   </b>
- <b>  ram   </b>
- <b>  battery_power   </b>
- <b> touch_screen </b>
- <b>    four_g </b>
- <b>   three_g  </b>
- <b>   px_height  </b>
- <b> px_weight </b>
- <b> wifi </b>

Checking on these features

In [None]:
df_train['price_range'].value_counts()

### Ram

In [None]:
sns.displot( x = 'ram', data = df_train, kind = 'kde', height = 7, aspect = 1);

In [None]:
sns.catplot(y = 'ram',x = 'price_range',data = df_train, height = 7, aspect = 1);

Offcourse, lower the ram lower the price. Higher the ram higher the price.

### battery_power

In [None]:
sns.displot(x = 'battery_power', data = df_train, kind = 'kde', height = 7, aspect = 1);

In [None]:
sns.catplot(y = 'battery_power', x = 'price_range', data = df_train, aspect = 1, height = 7);

From the figure, we can say that:
- The <b> blue plots i.e or lower price range </b> has more number of data distributed on <b> battery_power 600-1000 </b> 
- Whereas, <b> the red points or higher price range </b> has more datas distributed on <b> battery_power 1400-2000 </b>

### touch_screen

In [None]:
sns.displot(df_train['touch_screen'], kind = 'kde', aspect = 1, height = 7);

In [None]:
df_train['touch_screen'].value_counts()

In [None]:
sns.catplot(y = 'battery_power', x = 'price_range', hue = 'touch_screen', data = df_train, aspect = 1, height = 7);

In [None]:
sns.catplot(y = 'ram',x = 'price_range', hue = 'touch_screen',data = df_train, height = 7, aspect = 1);

<b> touch_screen </b> is equally distributed on all price range. We don't see anything here.

### four_g

In [None]:
df_train['four_g'].value_counts()

In [None]:
sns.catplot(y = 'battery_power', x = 'price_range', hue = 'four_g', data = df_train, aspect = 1, height = 7);

In [None]:
sns.catplot(y = 'ram',x = 'price_range', hue = 'four_g',data = df_train, height = 7, aspect = 1);

In [None]:
labels = ["4G-supported", "Not supported"]
values = df_train['four_g'].value_counts().values

In [None]:
fig1, ax1 = plt.subplots()
ax1.pie(values, labels = labels, autopct = '%1.1f%%', shadow = True, startangle = 90)
plt.show()

### three_g

In [None]:
sns.catplot(y = 'ram',x = 'price_range', hue = 'three_g',data = df_train, height = 7, aspect = 1);

In [None]:
sns.catplot(y = 'battery_power', x = 'price_range', hue = 'three_g', data = df_train, aspect = 1, height = 7);

In [None]:
labels3g = ["3G-supported","Not supported"]
values3g = df_train['three_g'].value_counts().values
fig1,ax1 = plt.subplots()
ax1.pie(values3g, labels = labels3g, autopct = '%1.1f%%', shadow = True, startangle = 90)
plt.show()

### px_height and px_weight

In [None]:
sns.catplot(x = 'price_range', y = 'px_height', data = df_train, aspect = 1, height = 7);

In [None]:
sns.catplot(x = 'price_range', y = 'px_width', data = df_train, aspect = 1, height = 7);

### fc (front_camera)

In [None]:
ax = plt.figure(figsize = (8,10))
sns.violinplot(x = 'price_range', y = 'fc', data = df_train);

In [None]:
sns.catplot(x = 'price_range', y = 'fc',hue = 'four_g', data = df_train, aspect = 1, height = 7);

### int_memory

In [None]:
sns.catplot(x = 'price_range', y ='int_memory', data = df_train, aspect = 1, height = 7);

### wifi

In [None]:
sns.catplot(y = 'ram',x = 'price_range', hue = 'wifi',data = df_train, height = 7, aspect = 1);

In [None]:
fig = plt.figure(figsize = (10,8))
sns.pointplot(y = "talk_time", x = "price_range", data = df_train);

As we don't see anything interesting, we will prepare for Modeling

# Model Preparation
- Encoding categorical variables
- Normalization
- Split training and testing

### Encoding categorical variables

In [None]:
df_train.info()

As we can see, there are no any string values which need to be converted on numeric.

### Normalization

In [None]:
df_train.describe()

Let's analyze the prediction of the alogritm with and without Normalization

In [None]:
dfx_train = df_train.copy()

In [None]:
from sklearn.preprocessing import StandardScaler, RobustScaler

In [None]:
std_scaler = StandardScaler()

for column in ['battery_power','int_memory','mobile_wt','px_height','px_width','ram']:
    df_train[column] = std_scaler.fit_transform(df_train[column].values.reshape(-1,1))

In [None]:
df_train.head()

### Split training and testing

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
## Split training and testing

X_train, X_test, y_train, y_test = train_test_split(df_train.drop('price_range', axis = 1), df_train['price_range'], test_size = 0.30, random_state = 141)

In [None]:
X2_train,X2_test, y2_train, y2_test = train_test_split(dfx_train.drop('price_range',axis = 1), dfx_train['price_range'], test_size = 0.30, random_state = 141)

## Models and tuning

Since it is classification problem, we will test it with following algorithms:
- <b> Logistic Regression </b>
- <b> Naive Bayes
- K-Nearest Neighbour
- Decision Tree
- Random Forest
- Support Vector Machine </b>

We are going to create and train several machine learning models to see their performance in this used dataset for price prediction

We use <b> F1 Score, Precision, Recall, ROC Curve, Confusion Matrix </b> and <b> PR Curve </b> as the way to evaluate our models

In [None]:
model_score = pd.DataFrame(columns = ('Accuracy', 'rmse'))

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics 
import math

In [None]:
lrmodel = LogisticRegression()

In [None]:
lrmodel.fit(X_train, y_train)

In [None]:
lrmodel.score(X_test,y_test)

In [None]:
## Without normalization

In [None]:
lrmodel2 = LogisticRegression()

In [None]:
lrmodel2.fit(X2_train, y2_train)

In [None]:
lrmodel2.score(X2_test, y2_test)

### Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import precision_score, recall_score, confusion_matrix

In [None]:
naiveb = GaussianNB()

In [None]:
naiveb.fit(X_train, y_train)

In [None]:
naiveb.score(X_test, y_test)

### K-Nearest Neighbour

In [None]:
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

In [None]:
knn = KNeighborsClassifier(n_neighbors = 7)
knn2 = KNeighborsClassifier(n_neighbors = 7)

In [None]:
knn.fit(X_train, y_train)

In [None]:
knn.score(X_test, y_test)

In [None]:
## Without normalization
knn2.fit(X2_train, y2_train)

In [None]:
knn2.score(X2_test, y2_test)

In [None]:
### Elbow Method for optimum values of K

In [None]:
error_rate = []

for i in range(1,20):
    knnx = KNeighborsClassifier(n_neighbors = i)
    knnx.fit(X2_train, y2_train)
    pred_i = knnx.predict(X2_test)
    error_rate.append(np.mean(pred_i != y_test))

In [None]:
plt.figure(figsize = (10,6))
plt.plot(range(1,20), error_rate, color = 'blue', linestyle = 'dashed', marker = 'o',
        markerfacecolor = 'red', markersize = 5);
plt.title("Error Rate vs K Value")
plt.xlabel("K")
plt.ylabel("Error Rate")

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dt = DecisionTreeClassifier()

In [None]:
dt = dt.fit(X_train, y_train)

In [None]:
y_pred = dt.predict(X_test)

In [None]:
print("Accuracy", metrics.accuracy_score(y_test, y_pred))

In [None]:
dt.score(X_test, y_test)

In [None]:
## Without normalization

In [None]:
dt.fit(X2_train,y2_train)

In [None]:
dt.score(X2_test, y2_test)

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfclf = RandomForestClassifier(n_estimators = 100)


In [None]:
rfclf.fit(X_train, y_train)

In [None]:
rfclf.score(X_test, y_test)

In [None]:
y_pred = rfclf.predict(X_test)

In [None]:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

In [None]:
# Without normalization

In [None]:
rfclf.fit(X2_train, y2_train)

In [None]:
rfclf.score(X2_test, y2_test)

In [None]:
rfclf.score(X_test,y_test)

### Support Vector Machine

In [None]:
from sklearn import svm

In [None]:
clf  = svm.SVC(kernel='linear')


In [None]:
clf.fit(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

In [None]:
## Without normalization

In [None]:
clf2 = svm.SVC(kernel='linear')

In [None]:
clf2.fit(X2_train, y2_train)

In [None]:
clf.score(X2_test, y2_test)

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm2 = LinearRegression()

In [None]:
lm.fit(X_train, y_train)

In [None]:
lm.score(X_test, y_test)

In [None]:
### Without normalization

In [None]:
lm2.fit(X2_train, y2_train)

In [None]:
lm2.score(X2_test, y2_test)

### Conclusion: KNN & Linear Regression performed the best

##### Linear Regression

In [None]:
y_pred = lm.predict(X_test)

In [None]:
plt.scatter(y_test, y_pred);

In [None]:
plt.plot(y_test, y_pred);

## Result: KNN

Since KNN result much better without normalizing so we are doing all with <b> X2 or Y2 test and train </b>

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
pred = knn2.predict(X2_test)

In [None]:
print(classification_report(y2_test, pred))

In [None]:
matrix = confusion_matrix(y2_test, pred)
print(matrix)

In [None]:
plt.figure(figsize = (10,7))
sns.heatmap(matrix, annot = True);

# Price prediction of Test.csv Using KNN for Prediction

In [None]:
df_test.head()

In [None]:
df_test = df_test.drop('id', axis = 1)

In [None]:
df_test.head()

### Model

In [None]:
predicted_price = knn2.predict(df_test)

In [None]:
predicted_price

## Adding Predicted price to test.csv

In [None]:
df_test['price_range'] = predicted_price

In [None]:
df_test