<img src="the_leagueAI_Logo.png" alt="the_leagueAI_Logo" width="200"/>

## 1. Introduction

The Banana Company Inc. is a new mobile phone company wanting to disrupt the mobile phone business. Banana wants to compete with Apple and get in on this competitive market. However, since they are brand new to the game, they are seeking a Machine Learning Engineer (that’s YOU!) to help them estimate the most optimal price to sell their mobile phones. 

Today Banana Inc. will provide you a dataset containing smartphones design features and their cost from different phone manufacturers. The goal is to build a model that can estimate the price of any given phone based on the phone features. Having a machine learning model that can estimate the price range of any phone will give Banana Inc. a strategic advantage when deciding how to price their own phones. 



<img src="phones_image.jpeg" alt="phones_image" width="1000"/>

## The Data
Banana Inc. has been able to gather information on the technical features and cost of several competitors smartphones. The features gathered for each of these phones are described below:

Features:
   - ***id***: Identifies every single phone record available in our dataset
   - ***battery_power***: Indicates the total energy a battery can store at once (measured in mAh)
   - ***blue***: Indicates whether or not the phone has bluetooth capabilities
   - ***clock_speed***: Indicates the speed at which the microprocessor executes instructions
   - ***dual_sim***: Indicates whether or not the phone has dual sim support
   - ***fc***: Indicates the number of megapixels available in the front camera
   - ***four_g***: Indicates whether or not the phone has 4G capabilities
   - ***int_memory***: Indicates the phone internal memory in Gigabytes
   - ***m_dep***: Indicates the phone’s depth in cm
   - ***mobile_wt***: Indicates the phone’s weight 
   - ***n_cores***: Indicates the number of cores the processor contains
   - ***pc***: Indicates the number of mega pixels available in the phone’s primary camera mega pixels
   - ***px_height***: Indicates the screen’s vertical pixel resolution
   - ***px_width***: Indicates the screen’s horizontal pixel resolution
   - ***ram***: Indicates the phone’s available ram
   - ***sc_h***: Indicates the phone’s height in centimeters
   - ***sc_w***: Indicates the phone’s width in centimeters
   - ***talk_time***: Indicates the maximum call time that the phone’s battery will last on a single charge
   - ***three_g***: Indicates whether or not the phone has 3G capabilities
   - ***touch_screen***: Indicates whether or not the phone has a touch screen
   - ***wifi***: Indicates whether or not the phone has wi-fi capabilities

Target:
   - ***price_range***: The price category for each phone. The different categories are explained using the table below:
   
| price_range Value | Range Description | Dollar Value Range |
| --- | --- | --- |
| 0 | Budget | <199 | 
|  |  | |
| 1 | Midrange | 200-599 | 
|  |  |  |
| 2 | Premium | 600-999 | 
|  |  |  |
| 3 | Flagship | >1000 | 
   
   
   
Notes to Concider:
- The data provided to you is sitting on a csv file named mobile_data_raw.csv
- This data was gathered manually and therefore might have a lot of issues


## The Objective:

***To build a machine learning model that is capable of predicting the price range of a smartphone when provided the technical specifications (features) of that phone.***

## The Approach:
Remember of the general approach to working on a Machine Learning Project:

 
    1. Start off by loading and viewing the dataset. Make sure to get a general understanding of how the data looks (data types, numerical range, ect.)
 
    2. Prepare the data to make sure that you are not missing any values and that your data will be digested by your machine learning model as expected.
    
    3. Build some intuition on your data by exploring the features. Understand how your features will ultimately help your ml model make a prediction using the context of the problem.
 
    4. Finally build the machine learning model and test its accuracy.


## Data Load
Load the data from the file provided and inspect it.

In [1]:
# Import pandas
import pandas as pd

# Load dataset
mobiles_df=pd.read_csv('./train_mobile_data.csv')

# Inspect data
mobiles_df.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,...,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,...,1208,1212,1411,8,2,15,1,1,0,1


In [2]:
mobiles_df.value_counts('price_range')

price_range
0    500
1    500
2    500
3    500
dtype: int64

## Data Exploration
Attempt to understand the data with statistical and visualization methods. This step will help you identify patterns and problems in the dataset.

The steps you should consider in this stage include:

- Identify input(features) and output(target) variables on your data
- What is the size of our data (data shape)
- Identify the data types of each one of the features
- Identify the number missing values on each feature
- Identify categorical vs continuous variables
- Understand the statistical properties each feature

In [39]:
# Separate features and target variables
features = dataset.drop('price_range',axis=1)
target = dataset['price_range']

In [35]:
# print our data shape
print('the shape of the feature dataset is ', features.shape)

the shape of the feature dataset is  (2000, 20)


In [17]:
# Code for data exploration
dataset.info()
dataset.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   battery_power  2000 non-null   int64  
 1   blue           2000 non-null   int64  
 2   clock_speed    2000 non-null   float64
 3   dual_sim       2000 non-null   int64  
 4   fc             2000 non-null   int64  
 5   four_g         2000 non-null   int64  
 6   int_memory     2000 non-null   int64  
 7   m_dep          2000 non-null   float64
 8   mobile_wt      2000 non-null   int64  
 9   n_cores        2000 non-null   int64  
 10  pc             2000 non-null   int64  
 11  px_height      2000 non-null   int64  
 12  px_width       2000 non-null   int64  
 13  ram            2000 non-null   int64  
 14  sc_h           2000 non-null   int64  
 15  sc_w           2000 non-null   int64  
 16  talk_time      2000 non-null   int64  
 17  three_g        2000 non-null   int64  
 18  touch_sc

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,...,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,1238.5185,0.495,1.52225,0.5095,4.3095,0.5215,32.0465,0.50175,140.249,4.5205,...,645.108,1251.5155,2124.213,12.3065,5.767,11.011,0.7615,0.503,0.507,1.5
std,439.418206,0.5001,0.816004,0.500035,4.341444,0.499662,18.145715,0.288416,35.399655,2.287837,...,443.780811,432.199447,1084.732044,4.213245,4.356398,5.463955,0.426273,0.500116,0.500076,1.118314
min,501.0,0.0,0.5,0.0,0.0,0.0,2.0,0.1,80.0,1.0,...,0.0,500.0,256.0,5.0,0.0,2.0,0.0,0.0,0.0,0.0
25%,851.75,0.0,0.7,0.0,1.0,0.0,16.0,0.2,109.0,3.0,...,282.75,874.75,1207.5,9.0,2.0,6.0,1.0,0.0,0.0,0.75
50%,1226.0,0.0,1.5,1.0,3.0,1.0,32.0,0.5,141.0,4.0,...,564.0,1247.0,2146.5,12.0,5.0,11.0,1.0,1.0,1.0,1.5
75%,1615.25,1.0,2.2,1.0,7.0,1.0,48.0,0.8,170.0,7.0,...,947.25,1633.0,3064.5,16.0,9.0,16.0,1.0,1.0,1.0,2.25
max,1998.0,1.0,3.0,1.0,19.0,1.0,64.0,1.0,200.0,8.0,...,1960.0,1998.0,3998.0,19.0,18.0,20.0,1.0,1.0,1.0,3.0


In [40]:
# check for empty values
dataset.isnull().sum()

battery_power    0
blue             0
clock_speed      0
dual_sim         0
fc               0
four_g           0
int_memory       0
m_dep            0
mobile_wt        0
n_cores          0
pc               0
px_height        0
px_width         0
ram              0
sc_h             0
sc_w             0
talk_time        0
three_g          0
touch_screen     0
wifi             0
price_range      0
dtype: int64

## Data Visualization
We now have a basic idea about the data. We need to extend that with some 

We are going to look at two types of plots:

- Histograms plot to have an idea of the distribution
- Scatter plots to find some of the correlation between variables

In [None]:
# Code for data visualization
#sns.pairplot(dataset,hue='price_range')

# how is price affected by ram?
sns.jointplot(x='ram',y='price_range',data=dataset,color='red',kind='kde');

# how is price affected by internam memory
sns.pointplot(y="int_memory", x="price_range", data=dataset)

# % percentage of phones wich support 3G
labels = ["3G-supported",'Not supported']
values=dataset['three_g'].value_counts().values
fig1, ax1 = plt.subplots()
ax1.pie(values, labels=labels, autopct='%1.1f%%',shadow=True,startangle=90)
plt.show()

# % percentage of phones that support 4G
labels4g = ["4G-supported",'Not supported']
values4g = dataset['four_g'].value_counts().values
fig1, ax1 = plt.subplots()
ax1.pie(values4g, labels=labels4g, autopct='%1.1f%%',shadow=True,startangle=90)
plt.show()

# How is price affected by battery power
sns.boxplot(x="price_range", y="battery_power", data=dataset)

# No of phones vs camera megapiles of front and primary camera
plt.figure(figsize=(10,6))
dataset['fc'].hist(alpha=0.5,color='blue',label='Front camera')
dataset['pc'].hist(alpha=0.5,color='red',label='Primary camera')
plt.legend()
plt.xlabel('MegaPixels')

# Talk time vs price range
sns.pointplot(y="talk_time", x="price_range", data=dataset)

## Data Preparation
You must now begin the process of transforming raw data so that data it is run through your ml model

- Modify the data types of each feature (if needed)
- Look for missing values, replace or remove
- Modify skewed variables
- Remove outliers

In [18]:
# Modify the target so that only two classes are concidered

In [None]:
#

## Feature Engineering
You must now begin the process of extracting more information from existing data. You are not adding any new data here, but you are actually making the data you already have more useful.


The steps you should consider in this stage include:

- Developing new features apart from those already generated

- Selecting a set of features to remove

- Creating features using existing data through mathematical operations 

- Applying feature scaling

- Applying label encoding

- Understanding correlation between features and target




In [None]:
# Code for feature engineering

In [None]:
# Code for feature engineering

## Build Train and Test
For training a model we initially split the model into 3 three sections which are ‘Training data’ ,‘Validation data’ and ‘Testing data’.
You train the classifier using ‘training data set’, tune the parameters using ‘validation set’ and then test the performance of your classifier on unseen ‘test data set’. 

- Note: during training the classifier only the training and/or validation set is available. The test data set must not be used during training the classifier. The test set will only be available during testing the classifier.

In [21]:
# Import train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.33, random_state=101)

## Building the Model

Now that the data has been processed it is time to determine the what model will be used to find our predictions. 

Consider the following points before making a choosing a model:

- The type of prediction this project requires (classification/regression)
- How well do you understand the model you want to use.
- Previoous performance of the model you choose on similar data



In [22]:
# Code to build the model
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
dtree.score(X_test,y_test)

0.8242424242424242

In [23]:
feature_names=['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g',
       'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height',
       'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi']

In [24]:
#For tree Visualization as kaggle does't support pydotplus just install the pydotplus in your systems's conda terminal
'''
import pydotplus as pydot

from IPython.display import Image

from sklearn.externals.six import StringIO

dot_data = StringIO()

tree.export_graphviz(dtree, out_file=dot_data,feature_names=feature_names)

graph = pydot.graph_from_dot_data(dot_data.getvalue())

Image(graph.create_png())'''

'\nimport pydotplus as pydot\n\nfrom IPython.display import Image\n\nfrom sklearn.externals.six import StringIO\n\ndot_data = StringIO()\n\ntree.export_graphviz(dtree, out_file=dot_data,feature_names=feature_names)\n\ngraph = pydot.graph_from_dot_data(dot_data.getvalue())\n\nImage(graph.create_png())'

In [25]:
#Another way
from IPython.display import Image  
from sklearn.externals.six import StringIO  
from sklearn.tree import export_graphviz
import pydot 
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'
dot_data = StringIO()  
export_graphviz(dtree, out_file=dot_data,feature_names=feature_names,filled=True)

graph = pydot.graph_from_dot_data(dot_data.getvalue())  
Image(graph[0].create_png())  

ModuleNotFoundError: No module named 'sklearn.externals.six'

In [26]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(X_train, y_train)

RandomForestClassifier(n_estimators=200)

## Evaluating and Accuracy Metrics
<p>But how well does our model perform? </p>


In [27]:
# Data Accuracy metrics code

In [28]:
rfc.score(X_test,y_test)

0.8712121212121212

In [29]:
pred = rfc.predict(X_test)

In [30]:
from sklearn.metrics import classification_report,confusion_matrix

In [31]:
print(classification_report(y_test,pred))

              precision    recall  f1-score   support

           0       0.93      0.95      0.94       158
           1       0.80      0.89      0.84       152
           2       0.90      0.75      0.82       199
           3       0.86      0.93      0.89       151

    accuracy                           0.87       660
   macro avg       0.87      0.88      0.87       660
weighted avg       0.87      0.87      0.87       660



In [32]:
matrix=confusion_matrix(y_test,pred)
print(matrix)

[[150   8   0   0]
 [ 11 136   5   0]
 [  0  27 149  23]
 [  0   0  11 140]]


In [None]:
plt.figure(figsize = (10,7))
sns.heatmap(matrix,annot=True)

In [34]:
target.value_counts()


3    500
2    500
1    500
0    500
Name: price_range, dtype: int64