
## Introduction

Bob has started his own mobile company. He wants to give tough fight to big companies like Apple,Samsung etc.

He does not know how to estimate price of mobiles his company creates. In this competitive mobile phone market you cannot simply assume things. To solve this problem he collects sales data of mobile phones of various companies.

Bob wants to find out some relation between features of a mobile phone(eg:- RAM,Internal Memory etc) and its selling price. But he is not so good at Machine Learning. So he needs your help to solve this problem.

In this problem we need to predict price range of mobile phones.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Import Libraries

In [None]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns 
import numpy as np

from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## Basic Data Analysis

In [None]:
# Reading Train and Test data in Data Frame
df_train_og = pd.read_csv("/kaggle/input/mobile-price-classification/train.csv")
df_test_og = pd.read_csv("/kaggle/input/mobile-price-classification/test.csv")

### Column Name and Descriptions

1. battery_power = Battery Power in hz 
2. blue = Bluetooth available or not 
3. clock_speed = Microprocesssor Speed
4. dual_sim = Has Dual Sim Card or Not
5. fc = Front Camera Mega Pixels
6. four_g = 4G or Not
7. int_memory  = Internal Memory in GB
8. m_dep = Mobile Depth
9. mobile_wt = Weight of mobile phone
10. n_cores = Number of cores of processor
11. pc = Primary Camera mega pixels
12. px_height = Pixel Resolution Height
13. px_width = Pixel Resolution Width
14. ram = Random Access Memory in Megabytes
15. sc_height = Screen Height of mobile in cm
16. sc_width = Screen Width of mobile in cm
17. talk_time = longest time that a single battery charge will last when you are
18. three_g = Has 3G or not
19. touch_screen = Has touch screen or not
20. wifi = Has wifi or not
21. price_range = 0 : low cost, 1: mid cost, 2: High Cost

In [None]:
# Displaying first 5 rows of train data
df_train_og.head()

In [None]:
# Displaying first 5 rows of test data
df_test_og.head()

In [None]:
# Number of values in train and test data frames
print("Number of Train Data is {}.".format(df_train_og.shape[0]))
print("Number of Test Data is {}.".format(df_test_og.shape[0]))

In [None]:
# Dropping id column in test data
df_test_og = df_test_og.drop(["id"],axis=1)

In [None]:
# Checking null values in train dataset
df_train_og.isnull().sum()

In [None]:
# Checking null values in test dataset
df_test_og.isnull().sum()

### No missing data in training or test data

### price_range is the dependent variable and rest of the features are independent variables. 

## Exploratory Data Analysis and Visualization

In [None]:
corr = df_train_og.corr()
fig = plt.figure()
plt.figure(figsize=(12,8))
r = sns.heatmap(corr)
r.set_title("Correlation Heatmap")

In [None]:
# Sorting from highest to lowest correlated columns
corr.sort_values(by=['price_range'], ascending = False)['price_range']

### From this data we can say that ram of the phone is the most significat factor in price range because it has high correlation.


## Modelling

In [None]:
# Creating X and y where X has all independent variables and y is the dependet variable
X = df_train_og.drop("price_range",axis=1)
y=df_train_og["price_range"]

In [None]:
# Creating a 80-20 train test split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2,random_state = 10)

### Logistic Regression without Standardizing or Scaling

In [None]:
def LogisticReg(XTrain,yTrain,XTest,yTest):
    logmodel=LogisticRegression()
    logmodel.fit(XTrain,yTrain)
    print(logmodel.score(XTest,yTest))
    pp=logmodel.predict(XTest)
    # Accuracy of Logistic Regression
    prediction_lr=logmodel.predict(XTest)
    print('The accuracy of the Logistic Regression is',round(accuracy_score(prediction_lr,yTest)*100,2))

In [None]:
LogisticReg(X_train,y_train,X_test,y_test)

#### Logistic Regression has acheived an accuracy of 64%

### K-Nearest Neighbor Technique

In [None]:
# KNN Model with neighbors = 10
def KNN(XTrain,yTrain,XTest,yTest):
    knn = KNeighborsClassifier(n_neighbors=10)
    # Fitting train data on logistic regression model
    knn.fit(XTrain,yTrain)
    # Evaluating KNN
    print(knn.score(XTest,yTest))

In [None]:
KNN(X_train,y_train,X_test,y_test)

#### KNN has acheived an accuracy of 90.75%

### Standardizing the data before Logistic Regression

In [None]:
from tqdm.auto import tqdm
categorical_columns = []
numerical_columns = []
for c in tqdm(df_train_og.columns,total=len(df_train_og.columns)):
    if(len(df_train_og[c].value_counts())<=10):
        categorical_columns.append(c)
    else:
        numerical_columns.append(c)

In [None]:
scaler = StandardScaler()

In [None]:
def transform_numeric_features(train_data,numerical_columns):
    for col in numerical_columns:
        X = np.array(train_data[col]).reshape(-1,1)
        train_data[col] = scaler.fit_transform(X)
    return train_data

In [None]:
train_data_transformed = transform_numeric_features(df_train_og,numerical_columns)

In [None]:
y_new = train_data_transformed["price_range"]
X_new = train_data_transformed.drop(columns=["price_range"],axis=1)

In [None]:
categorical_columns.remove("price_range")
X_new = pd.get_dummies(X_new,columns=categorical_columns,prefix_sep="_")

In [None]:
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new,y_new, test_size = 0.2,random_state = 10)

In [None]:
LogisticReg(X_train_new,y_train_new,X_test_new,y_test_new)

#### Accuracy increased from 63% to 96% when we standardize the data.

### Selecting best features before Logistic Regression

In [None]:
# Selecting top features
topFeatures = SelectKBest(score_func=chi2, k=10)
fit = topFeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization 
scoreFeatures = pd.concat([dfcolumns,dfscores],axis=1)
scoreFeatures.columns = ['Specs','Score']  #naming the dataframe columns
top_10_imp_feature = scoreFeatures.nlargest(10,'Score')["Specs"].values

In [None]:
# Creating X and Y based on top 10 important feature
X_top = df_train_og[top_10_imp_feature]
y_top = df_train_og["price_range"]

In [None]:
from tqdm.auto import tqdm
categorical_columns1 = []
numerical_columns1 = []
for c in tqdm(X_top.columns,total=len(X_top.columns)):
    if(len(X_top[c].value_counts())<=10):
        categorical_columns1.append(c)
    else:
        numerical_columns1.append(c)

In [None]:
X3 = transform_numeric_features(X_top,numerical_columns1)

In [None]:
X_top = pd.get_dummies(X_top,columns=categorical_columns1,prefix_sep="_")

In [None]:
X_train_feat, X_test_feat, y_train_feat, y_test_feat = train_test_split(X_top,y_top, test_size = 0.2,random_state = 10)

In [None]:
LogisticReg(X_train_feat,y_train_feat,X_test_feat,y_test_feat)

#### Accuracy increased from 95.5% to 97.25% when we standardize the data.