# **Mobile Price Data Analysis and Modelling**

### In this kernel, I will try to find the factors that will affect the prices of phones using the mobile price dataset. At the end of the kernel, I will try to use K-Nearest Neighbors (KNN) for estimating the price ranges of the phones with other information. If you have any suggestions, advice or correction please don't hesitate to write them.

    
<center><img src="https://assets.newatlas.com/dims4/default/91d6f6a/2147483647/strip/true/crop/3000x2000+0+125/resize/1200x800!/quality/90/?url=http%3A%2F%2Fnewatlas-brightspot.s3.amazonaws.com%2F81%2F53%2Fbb28f58a4b3fbf2210443d6157c7%2F01comparisonhero.jpg"></center>

# Introduction

Bob has started his own mobile company. He wants to give tough fight to big companies like Apple,Samsung etc.

He does not know how to estimate price of mobiles his company creates. In this competitive mobile phone market you cannot simply assume things. To solve this problem he collects sales data of mobile phones of various companies.

Bob wants to find out some relation between features of a mobile phone(eg:- RAM,Internal Memory etc) and its selling price. But he is not so good at Machine Learning. So he needs your help to solve this problem.

In this problem we need to predict price range of mobile phones.

# Table of contents:

* [1. Import libraries](#1)
* [2. Basic Data Analysis](#2)
* [3. Exploratory Data Analysis and Visualization](#3)
* [4. Modeling](#4)

<a id="1"></a>
# Import libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

import warnings

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

<a id="2"></a>
# Basic Data Analysis
In this section we will do a quick look through the data.

In [None]:
df = pd.read_csv("../input/mobile-price-classification/train.csv")

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
x = df.columns.tolist()

fig = go.Figure(go.Bar(x = x, y = df.count().tolist(), name='non-NaN'))
fig.add_trace(go.Bar(x = x, y = df.isnull().sum(axis = 0).tolist(), name='NaN'))

fig.update_layout(barmode='stack', title_text="NaN Value Check Bar Chart", uniformtext=dict(mode="hide", minsize=10), xaxis_title = "Columns", yaxis_title = "Number of samples")

fig.show()

Here we are checking the existence of NaN values in our dataset. As you can see there is not any NaN value in the dataset.

In [None]:
fig = go.Figure(data=[go.Bar(
            x = df["price_range"].index, y = df['price_range'].value_counts(),
            text = df['price_range'].value_counts(),
            textposition = 'auto',
        )])

fig.update_layout(
    autosize=False,
    width=500,
    height=500,
    yaxis_title = "Price range",
    xaxis_title = "Number of samples"
)

fig.show()

<a id="3"></a>
# Exploratory Data Analysis and Visualization

In [None]:
corr = df.corr()
fig = plt.figure()
plt.figure(figsize=(12,8))
r = sns.heatmap(corr)
r.set_title("Correlation Heatmap")

Here, you can see that the price range has highly positive correlation between ram. In addition to that, 3G and 4G, primary camera mega pixels and front camera mega pixels has positive correlation. Additionally, px_weight and px_height, sc_w and sc_h has high positive correlation.

In [None]:
corr.sort_values(by=['price_range'], ascending = False)['price_range']

From this data we can say that ram of the phone is the most significat factor in price range because it has high correlation.

In [None]:
fig = px.scatter(y = df.ram, x = df["price_range"], title='Effect of ram on price')

fig.update_layout(
    autosize=False,
    width=500,
    height=500,
    yaxis_title = "Ram",
    xaxis_title = "Price Range"
)

fig.show()

In [None]:

fig = make_subplots(rows=2, cols=1)

fig.append_trace(
    go.Scatter(
        y = df["battery_power"],
        x = df["ram"],
        mode='markers',
        marker=dict(
            size=16,
            color= df["price_range"],
            showscale=True
        )
    ), row=1, col=1)

fig.append_trace(
    go.Scatter(
        y = df["talk_time"],
        x = df["ram"],
        mode='markers',
        marker=dict(
            size=16,
            color= df["price_range"],
            showscale=True
        )
    ), row=2, col=1)



fig.update_xaxes(title_text="Ram", row=1, col=1)
fig.update_yaxes(title_text="Battery power", row=1, col=1)

fig.update_xaxes(title_text="Ram", row=2, col=1)
fig.update_yaxes(title_text="Talk time", row=2, col=1)


fig.update_layout(title_text="Relationship of battery power and talk time with ram",showlegend=False, height=900)
fig.show()


In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(x = df["fc"], name='Front camera'))
fig.add_trace(go.Histogram(x = df["pc"], name='Primary camera'))

fig.update_traces(opacity=0.65)

fig.update_layout(
    barmode='overlay',
    title_text='Number of phones with camera megapixels of front and primary camera',
    xaxis_title_text='Megapixel', 
    yaxis_title_text='Count')
fig.show()

In [None]:
fig = px.box(df, x = "three_g", y = "ram", color="price_range" , points = "all")

fig.update_layout(
    barmode='overlay',
    title_text='Ram values according to 3G and price range',
    xaxis_title_text='Megapixel', 
    yaxis_title_text='Count'
)

fig.show()

In [None]:
fig = px.box(df, x = "four_g", y = "ram", color="price_range" , points = "all")

fig.update_layout(
    barmode='overlay',
    title_text='Ram values according to 4G and price range',
    xaxis_title_text='Megapixel', 
    yaxis_title_text='Count'
)



fig.show()

<a id="4"></a>
# Modelling

In [None]:
X = df.drop('price_range',axis=1)
Y = df['price_range']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=101)

In [None]:
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train,y_train)

In [None]:
knn.score(X_test,y_test)

In [None]:
neighbors = list(range(1,50,2))
cv_scores = []

for K in neighbors:
    knn = KNeighborsClassifier(n_neighbors = K)
    scores = cross_val_score(knn,X_train,y_train,cv = 10,scoring =
    "accuracy")
    cv_scores.append(scores.mean())

# Changing to mis classification error
mse = [1-x for x in cv_scores]

# Determing best k
optimal_k = neighbors[mse.index(min(mse))]
print("The optimal no. of neighbors is {}".format(optimal_k))

Here I tried to find the optimal count of neighbors for KNN.

In [None]:
df_accuracy = pd.DataFrame({"K":[i for i in range(1,50,2)], "Accuracy":cv_scores})
fig = px.bar(df_accuracy, x='K', y='Accuracy')

fig.update_yaxes(range = [0.8,1])
fig.show()

In [None]:
knn = KNeighborsClassifier(n_neighbors=45)
knn.fit(X_train,y_train)

In [None]:
pred = knn.predict(X_test)

In [None]:
print(classification_report(y_test,pred))

In [None]:
print(confusion_matrix(y_test,pred))

In [None]:
fig = px.imshow(confusion_matrix(y_test,pred), labels=dict(color="Count"),)
fig.update_layout(
    autosize=False,
    width=500,
    height=500
)
fig.show()

In [None]:
df_test=pd.read_csv('../input/mobile-price-classification/test.csv')

In [None]:
df_test.head()

In [None]:
df_test = df_test.drop('id',axis=1)

In [None]:
df_test_pred = knn.predict(df_test)

In [None]:
df_test['price_range'] = df_test_pred

In [None]:
df_test

**Thank You!** If you have any suggestion or advice or feedback, I will be very appreciated to hear them.