<img src="the_leagueAI_Logo.png" alt="the_leagueAI_Logo" width="200"/>

## 1. Introduction

The Banana Company Inc. is a new mobile phone company wanting to disrupt the mobile phone business. Banana wants to compete with Apple and get in on this competitive market. However, since they are brand new to the game, they are seeking a Machine Learning Engineer (that’s YOU!) to help them estimate the most optimal price to sell their mobile phones. 

Today Banana Inc. will provide you a dataset containing smartphones design features and their cost from different phone manufacturers. The goal is to build a model that can estimate the price of any given phone based on the phone features. Having a machine learning model that can estimate the price range of any phone will give Banana Inc. a strategic advantage when deciding how to price their own phones. 



<img src="phones_image.jpeg" alt="phones_image" width="1000"/>

## The Data
Banana Inc. has been able to gather information on the technical features and cost of several competitors smartphones. The features gathered for each of these phones are described below:

Features:
   - ***id***: Identifies every single phone record available in our dataset
   - ***battery_power***: Indicates the total energy a battery can store at once (measured in mAh)
   - ***blue***: Indicates whether or not the phone has bluetooth capabilities
   - ***clock_speed***: Indicates the speed at which the microprocessor executes instructions
   - ***dual_sim***: Indicates whether or not the phone has dual sim support
   - ***fc***: Indicates the number of megapixels available in the front camera
   - ***four_g***: Indicates whether or not the phone has 4G capabilities
   - ***int_memory***: Indicates the phone internal memory in Gigabytes
   - ***m_dep***: Indicates the phone’s depth in cm
   - ***mobile_wt***: Indicates the phone’s weight 
   - ***n_cores***: Indicates the number of cores the processor contains
   - ***pc***: Indicates the number of mega pixels available in the phone’s primary camera mega pixels
   - ***px_height***: Indicates the screen’s vertical pixel resolution
   - ***px_width***: Indicates the screen’s horizontal pixel resolution
   - ***ram***: Indicates the phone’s available ram
   - ***sc_h***: Indicates the phone’s height in centimeters
   - ***sc_w***: Indicates the phone’s width in centimeters
   - ***talk_time***: Indicates the maximum call time that the phone’s battery will last on a single charge
   - ***three_g***: Indicates whether or not the phone has 3G capabilities
   - ***touch_screen***: Indicates whether or not the phone has a touch screen
   - ***wifi***: Indicates whether or not the phone has wi-fi capabilities

Target:
   - ***price_range***: The price category for each phone. The different categories are explained using the table below:
   
| price_range Value | Range Description | Dollar Value Range |
| --- | --- | --- |
| 0 | Budget - Midrange | 0-699 | 
|  |  |  |
| 1 | Premium - Flagship | 700-1300 | 

   
   
   
Notes to Concider:
- The data provided to you is sitting on a csv file named mobile_data_raw.csv
- This data was gathered manually and therefore might have a lot of issues


## The Objective:

***To build a machine learning model that is capable of predicting the price range of a smartphone when provided the technical specifications (features) of that phone.***

## The Approach:
Remember of the general approach to working on a Machine Learning Project:

 
    1. Start off by loading and viewing the dataset. Make sure to get a general understanding of how the data looks (data types, numerical range, ect.)
 
    2. Prepare the data to make sure that you are not missing any values and that your data will be digested by your machine learning model as expected.
    
    3. Build some intuition on your data by exploring the features. Understand how your features will ultimately help your ml model make a prediction using the context of the problem.
 
    4. Finally build the machine learning model and test its accuracy.


---
## Data Load 
Load the data from the file provided and inspect it.

---

Steps to concider:
- Start by importing the modules and packages that you might be using in this project
- When importing the data concider the use of dataframes to store your data

In [None]:
# Import pandas
import pandas as pd

# Load dataset
mobiles_df=pd.read_csv('./train_mobile_data.csv')

# Inspect data
mobiles_df.head()

In [None]:
mobiles_df.value_counts('price_range')

---
## Data Exploration & Data Visualization
Attempt to understand the data with statistical and visualization methods. This step will help you identify patterns and problems in the dataset.

---

The steps you should consider in this stage include:

- Identify input(features) and output(target) variables on your data
- What is the size of our data (data shape)
- Identify the data types of each one of the features
- Identify the number missing values on each feature
- Identify categorical vs continuous(numerical) variables
- Understand the statistical properties each feature
- Creating histograms plot to have an idea of the distribution
- Creating scatter plots to find some of the correlation between variables


In [None]:
# Separate features and target variables
features = dataset.drop('price_range',axis=1)
target = dataset['price_range']

In [None]:
# print our data shape
print('the shape of the feature dataset is ', features.shape)

In [None]:
# Code for data exploration
dataset.info()
dataset.describe()

In [None]:
# check for empty values
dataset.isnull().sum()

In [None]:
#First we must group the dataset by Pclass and Survived to gather the total count
group = titanic_data.groupby(['pclass', 'survived'])
pclass_survived = group.size().unstack()


# Creating a histogram of age by survival
hist = px.histogram(titanic_data,x = "age", opacity = 0.7, color = "survived")
hist

In [None]:

categorical_columns = ["name","sex","cabin","embarked", "home.dest"]
numerical_columns = ["pclass","survived","age","sibsp","parch","ticket","fare","boat","body"]

In [None]:
for column in categorical_columns:
    print("-----------------------------")
    print("For column {}, the values are: \n{} \n".format(column, titanic_data[column].value_counts()))
    print("-----------------------------")

In [None]:
for col in numerical_columns:
    if col in ["pclass","survived","sibsp","parch"]:
        titanic_data.hist(column=col)

In [None]:
###### MORE EXAMPLES WE CAN USE ####


# Code for data visualization
#sns.pairplot(dataset,hue='price_range')

# how is price affected by ram?
sns.jointplot(x='ram',y='price_range',data=dataset,color='red',kind='kde');

# how is price affected by internam memory
sns.pointplot(y="int_memory", x="price_range", data=dataset)

# % percentage of phones wich support 3G
labels = ["3G-supported",'Not supported']
values=dataset['three_g'].value_counts().values
fig1, ax1 = plt.subplots()
ax1.pie(values, labels=labels, autopct='%1.1f%%',shadow=True,startangle=90)
plt.show()

# % percentage of phones that support 4G
labels4g = ["4G-supported",'Not supported']
values4g = dataset['four_g'].value_counts().values
fig1, ax1 = plt.subplots()
ax1.pie(values4g, labels=labels4g, autopct='%1.1f%%',shadow=True,startangle=90)
plt.show()

# How is price affected by battery power
sns.boxplot(x="price_range", y="battery_power", data=dataset)

# No of phones vs camera megapiles of front and primary camera
plt.figure(figsize=(10,6))
dataset['fc'].hist(alpha=0.5,color='blue',label='Front camera')
dataset['pc'].hist(alpha=0.5,color='red',label='Primary camera')
plt.legend()
plt.xlabel('MegaPixels')

# Talk time vs price range
sns.pointplot(y="talk_time", x="price_range", data=dataset)



##########################

----

## Data Preparation
You must now begin the process of transforming raw data so that data it is run through your ml model


---

- Modify the data types of each feature (if needed)
- Look for missing values, replace or remove
- Modify skewed variables
- Remove outliers

In [None]:
# Modify types

In [None]:
# Look for missing values, replace or remove

In [None]:
# Remove outliers
# import seaborn
import seaborn as sns

# calculate percentiles
age_percentiles = np.percentile(titanic_data['age'], [25, 50, 75])

# Print the result
print(age_percentiles)




---
## Feature Engineering
You must now begin the process of extracting more information from existing data. You are not adding any new data here, but you are actually making the data you already have more useful.

---

The steps you should consider in this stage include:

- Developing new features apart from those already generated

- Selecting a set of features to remove

- Creating features using existing data through mathematical operations 

- Applying feature scaling

- Applying label encoding

- Understanding correlation between features and target




In [None]:
# Code for feature engineering

In [None]:
# Code for feature engineering

---
## Building the Model

Now that the data has been processed it is time to determine and build the model that will be used to find our predictions. 

---

Consider the following points before making a choosing a model:

- Create a train and test sets of data from the provided data
- The type of prediction this project requires (classification/regression)
- Determine the best features to used based on feature importance
- Define your model and modify it's parameters 


In [None]:
#For tree Visualization as kaggle does't support pydotplus just install the pydotplus in your systems's conda terminal
'''
feature_names=['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g',
       'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height',
       'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi']
       
import pydotplus as pydot

from IPython.display import Image

from sklearn.externals.six import StringIO

dot_data = StringIO()

tree.export_graphviz(dtree, out_file=dot_data,feature_names=feature_names)

graph = pydot.graph_from_dot_data(dot_data.getvalue())

Image(graph.create_png())'''

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)
model = RandomForestClassifier()

model.fit(X_train, y_train)

feature_weights = model.feature_importances_
feature_weights

# Show top 5 features 
indices = np.argsort(feature_weights)[::-1]
columns = X_train.columns.values[indices[:5]]
values = feature_weights[indices][:5]

# Creat the plot
fig = plt.figure(figsize=(9, 5))
plt.title("Normalized Weights for First Five Most Predictive Features", fontsize=16)
plt.bar(np.arange(5), values, width=0.6, align="center", color='#00A000', \
       label="Feature Weight")
plt.bar(np.arange(5) - 0.3, np.cumsum(values), width=0.2, align="center", color='#00A0A0', \
       label="Cumulative Feature Weight")
plt.xticks(np.arange(5), columns)
plt.xlim((-0.5, 4.5))
plt.ylabel("Weight", fontsize=12)
plt.xlabel("Feature", fontsize=12)

plt.legend(loc='upper center')
plt.tight_layout()
plt.show()

In [None]:
# Import train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.33, random_state=101)

In [None]:
# Code to build the model
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
dtree.score(X_test,y_test)

---

## Accuracy Metrics
With the model finally completed, it is time to understand the model's performance.

---

Consider the following points before making a choosing a model:

- Import the modules that will allow you to estimate different accuracy metrics
- Determine the number of positive and negative predictions.
- Make an assessment of what our results tell us and draw conclusions based on your findings 
- Provide and display your results using appropriate variables

In [None]:
# Data Accuracy metrics code

In [None]:
rfc.score(X_test,y_test)

In [None]:
pred = rfc.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
print(classification_report(y_test,pred))

In [None]:
matrix=confusion_matrix(y_test,pred)
print(matrix)

In [None]:
plt.figure(figsize = (10,7))
sns.heatmap(matrix,annot=True)

In [None]:
target.value_counts()


In [None]:

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, balanced_accuracy_score 
from sklearn.metrics import precision_score, recall_score

# Find the accuracy of your predictions
test_accuracy = accuracy_score(y_test, y_test_pred)
print(test_accuracy)


# lets save the number of accurate predictions in a variable
pass_in_test = y_test.count()
print('The number of passengers in our test dataset was', pass_in_test)

titanic_confusion_matrix = confusion_matrix(y_test, y_test_pred)
print(titanic_confusion_matrix)print('The number of passengers which our model accurately predicted would survive/not survive was', test_accuracy*pass_in_test)


titanic_confusion_matrix = confusion_matrix(y_test, y_test_pred)
print(titanic_confusion_matrix)


# visualize the confusion matrix
plt.figure(figsize = (5,3))
sns.heatmap(titanic_confusion_matrix)

print(classification_report(y_test,y_test_pred))