<a href="https://colab.research.google.com/github/tirtha2016/Ml-Classification-_Mobile_Price_Range_Prediction/blob/main/Mobile_Price_Range_Prediction(Individual_copy).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - ***MOBILE PRICE RANGE PREDICTION :-***

#####**Project Type**    - Classification
##### **Contribution**    - Individual
##### **Team Member 1** - ***TIRTHA BOSE***

# **Project Summary -**

The mobile phone industry is fiercely competitive, and the price of a mobile phone is determined by multiple factors such as battery power, Bluetooth, camera quality, and screen size. To investigate the factors that influence the price range of mobile phones, a study was conducted. The study utilized a dataset containing approximately 21 variables to forecast the price range of mobile phones, which are categorized as low, medium, high, and very high.

 Initially, the analysis process focused on data wrangling, which involved managing missing values and verifying unique values. During this stage, it was discovered that 180 mobile phones had a pixel resolution height of 0, and two phones had a screen width of 0 cm. It is not logical for a phone screen width or pixel height to be 0, so the I decided to replace these 0 values with the mean values. This ensured that the dataset had no missing values.

 After I finished data wrangling, I performed exploratory data analysis (EDA). From this analysis, I discovered that all categories of mobile phones had an equal price range distribution. Furthermore, I found that there was a positive correlation between the battery capacity of a phone and its price range. The distribution of battery capacity also gradually increased as the price range increased, implying that consumers may be willing to pay more for a mobile phone with a higher battery capacity. In terms of Bluetooth usage, I found that almost half of the devices had it, while the other half did not.

 From the scatter plot, it was evident that there was a positive correlation between RAM and price range. The majority of the data points were clustered towards the upper right corner, indicating that as the price range increased, so did the amount of RAM in the device. The study also discovered that the count of devices with dual sim was increasing for the highest price range. Furthermore, the distribution of primary camera megapixels across various target categories remained consistent, suggesting that this feature may not have a significant impact on the price range of mobile phones.

 Based on the analysis of screen size distribution among different target categories, it was observed that there was not a significant difference in the distribution. This suggests that screen size alone may not be the primary factor in determining target categories. However, this consistency in distribution can be beneficial for predictive modeling, as it indicates that screen size may not play a significant role in distinguishing between different target categories, enabling other features to have a more significant impact in determining the target categories. Additionally, the study revealed that mobile phones with higher price ranges were generally lighter in weight than those with lower price ranges.

 Following the exploratory data analysis (EDA), the study conducted hypothesis testing on three statements while handling outliers. During this process, the study identified that RAM, battery power, and pixel quality were the most significant factors influencing the price range of mobile phones. Afterward, the study engaged in feature engineering and utilized various machine learning models, such as

 1) Logistic regression,

 2) Random forest, and

 3) XGBoost.

 After conducting experiments, the study found that logistic regression and XGBoost algorithms with hyperparameter tuning delivered the most accurate results in predicting the price range of mobile phones.

In summary, the study discovered that the mobile phones in the dataset were separated into four distinct price ranges, each containing an equivalent number of elements. Roughly half of the devices in the dataset had Bluetooth, while the other half did not. Additionally, the study observed that as the price range increased, there was a gradual rise in battery power, and the amount of RAM in the device exhibited continuous growth from low-cost to very high-cost phones. Moreover, the study identified that expensive phones generally tended to be lighter than their lower-priced counterparts.

# **GitHub Link -**

https://github.com/tirtha2016/Ml-Classification-_Mobile_Price_Range_Prediction

# **Problem Statement**


**In the competitive mobile phone market, companies want to understand sales data of mobile phones and factors which drive the prices. The objective is to find out some relation between features of a mobile phone(eg:- RAM, Internal Memory, etc) and its selling price. In this problem, we do not have to predict the actual price but a price range indicating how high the price is.**

Data Overview

* Battery_power - Total energy a battery can store in one time measured in mAh

* Blue - Has bluetooth or not

* *Clock_speed* - speed at which microprocessor executes instructions

* *Dual_sim* - Has dual sim support or not

* *Fc* - Front Camera mega pixels

* *Four_g* - Has 4G or not

* *Int_memory* - Internal Memory in Gigabytes

* *M_dep* - Mobile Depth in cm

* *Mobile_wt* - Weight of mobile phone

* *N_cores* - Number of cores of processor

* *Pc* - Primary Camera mega pixels

* *Px_height* - Pixel Resolution Height

* *Px_width* - Pixel Resolution Width

* *Ram* - Random Access Memory in Mega Bytes

* *Sc_h* - Screen Height of mobile in cm

* *Sc_w* - Screen Width of mobile in cm

* *Talk_time* - longest time that a single battery charge will last when you are

* *Three_g* - Has 3G or not

* *Touch_screen* - Has touch screen or not

* *Wifi* - Has wifi or not

* *Price_range* - This is the target variable with value of

 0(low cost),

 1(medium cost),

 2(high cost) and

 3(very high cost).

* Thus our target variable has 4 categories so basically it is a Multiclass classification problem.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import StackingClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
# Load Dataset
# Mounting Google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Loading Mobile Price Range Dataset
mp_df = pd.read_csv('/content/drive/MyDrive/ML PROJECT/Ml  Classification Project/data_mobile_price_range.csv')

### Dataset First View

In [None]:
# Dataset First Look
# first 7 rows  of the dataset
# Checking the first 5 rows of data
mp_df.head(7)

In [None]:
#Seven rows of the dataset from the bottom
mp_df.tail(7)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
mp_df.shape

### Dataset Information

In [None]:
# Dataset Info
mp_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_values_count = len(mp_df[mp_df.duplicated()])

print("Number of duplicate values:", duplicate_values_count)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
mp_df.isnull().sum()

In [None]:
# Visualizing the missing values
msno.bar(mp_df,
         fontsize=10,
         figsize=(7,4),
         color='magenta')
plt.title('Missing values')
plt.show()

### What did you know about your dataset?

**From the above analysis we got to know the following things about our dataset till now**

*   Our dataset consist of 2000 rows and 21 columns.

*  It has no null or empty values in the dataset

*  It has no duplicate values also

*  It consist two datatypes float and integers

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
mp_df.columns

In [None]:
# Length of the columns
print(f'There are {len(mp_df.columns)} columns in this mobile price range dataset')

In [None]:
# Dataset Describe
# Checking statistical data on numerical columns
mp_df.describe(include='all')

# Transpose of Data Description for better visibility and analysis
mp_df.describe().T

### Variables Description

1) Battery_power: Total energy a battery can store in single time measured in mAh.

2) Blue: Has bluetooth or not.

3) Clock_speed: Speed at which microprocessor executes instructions.

4) Dual_sim: Has dual sim support or not.

5) Fc: Front Camera Mega Pixels.

6) Four_g: Has 4G or not.

7) Int_memory: Internal Memory in Gigabytes.

8) M_dep: Mobile Depth in cm.

9) Mobile_wt: Weight of mobile phone.

10) N_cores: Number of cores of processor.

11) Pc: Primary Camera Mega Pixels.

12) Px_height: Pixel Resolution Height.

13) Px_width: Pixel Resolution Width.

14) Ram: Random Access Memory in Megabytes.

15) Touch_screen: Has touch screen or not.

16) Wifi: Has wifi or not.

17) Sc_h: Screen Height of mobile in cm.

18) Sc_w: Screen Width of mobile in cm.

19) Talk_time: longest time that a single battery charge will last when you are online.

20) Three_g: Has 3G or not.

21) Wifi: Has wifi or not.

22) Price_range: This is the target variable with value of 0(low cost), 1(medium cost), 2(High Cost), 3(Very High cost).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in mp_df.columns:
    unique_values = mp_df[column].unique()
    print(f"The Unique values for variable [{column}] are: {unique_values}")

In [None]:
# Checking the total number of Unique Values for each variable
mp_df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# It is not logical for a phone screen width or pixel height to have a value of 0, so we need to make sure to verify and address such instances to prevent any complications in our analysis
# Count of phones with sc_w = 0
sc_w_zero_count = sum(mp_df.sc_w == 0)
print(f"Number of phones with sc_w = 0: {sc_w_zero_count}")

# Count of phones with px_height = 0
px_height_zero_count = sum(mp_df.px_height == 0)
print(f"Number of phones with px_height = 0: {px_height_zero_count}")

In [None]:
# Replacing 0 values with the mean value
sc_w_mean = mp_df.sc_w.mean()
px_height_mean = mp_df.px_height.mean()

mp_df.sc_w = np.where(mp_df.sc_w == 0, sc_w_mean, mp_df.sc_w)
mp_df.px_height = np.where(mp_df.px_height == 0, px_height_mean, mp_df.px_height)

# Printing the updated dataframe
print(mp_df)

In [None]:
# Checking for the 0 values in the sc_w and px_height columns after the data wrangling

# Count of phones with sc_w = 0
sc_w_zero_count = sum(mp_df.sc_w == 0)
print(f"Number of phones with sc_w = 0: {sc_w_zero_count}")

# Count of phones with px_height = 0
px_height_zero_count = sum(mp_df.px_height == 0)
print(f"Number of phones with px_height = 0: {px_height_zero_count}")

##Duplicate Values

In [None]:
# Checking whether there are duplicates or not
print(f'There are {len(mp_df[mp_df.duplicated()])} duplicate values in the mobile price range data set')

##Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
mp_df.isnull().sum()

### What all manipulations have you done and insights you found?

##We observed the following insights:

i) I discovered that there are 2 phones in the dataset with a pixel height value of 0, and 180 phones with a screen width value of 0.


ii) It is illogical for a phone screen width or pixel height to have a value of 0, so it is necessary to identify and address these instances properly to prevent any potential problems in our analysis.


iii) The 0 values in the dataset have been replaced with their respective column mean values, ensuring that there are no longer any missing values in the table. Therefore, our data is now prepared for data analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

####**UNIVARIATE ANALYSIS:-**

#### Chart - 1

##What is the distribution of battery power of different mobile phones?

In [None]:
6# Chart - 1 visualization code
plt.figure(figsize = (7, 7))
sns.displot(mp_df["battery_power"], color='blue' , edgecolor='black',linewidth=1,
            bins = 20)
plt.xlabel('Battery Power')
plt.ylabel('Frequency')
plt.title('Distribution of Battery Power')
plt.show()

##### 1. Why did you pick the specific chart?

Here we use this "displot" chart because it help us to represents the univariate distribution of data i.e. data distribution of a variable against the density distribution

##### 2. What is/are the insight(s) found from the chart?

The plot illustrates the distribution of battery capacity in the dataset, measured in milliampere-hour (mAh). It can be observed that the distribution of battery capacity is almost uniform, with a slightly higher frequency in the lower battery power range. This implies that lower-end phones are sold more frequently than higher-end ones.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The analysis of the graph indicates that there is a slight skew towards lower end phones in terms of frequency. This suggests that lower end phone models are produced more frequently. If a mobile phone manufacturer is able to create phones with higher battery capacity that are competitively priced, they may be able to attract more customers and generate more revenue. This information could also be used to guide marketing and advertising strategies, as companies can focus on promoting the battery capacity of their phones as a key selling point to potential customers.

#### Chart - 2

##What is the percentage of different classes of mobile price range?

In [None]:
# Chart - 2 visualization code
# Classes of Mobile Price Range
price_counts = mp_df['price_range'].value_counts()
plt.pie(price_counts, labels = price_counts.index, autopct='%1.1f%%', shadow=True, startangle=180, explode=(0.05,0.05,0.05,0.05),
       wedgeprops={"edgecolor":"0",'linewidth': 1,'linestyle': 'solid', 'antialiased': True})
plt.title('Price Range Distribution')
plt.show()


##### 1. Why did you pick the specific chart?

Here we used this "pie charts" because it is  used to show percentages of a whole, and represents percentages at a set point in time. Unlike bar graphs and line graphs, pie charts do not show changes over time.

##### 2. What is/are the insight(s) found from the chart?

Different categories of price range of phones have equal percentage of distribution in the data set.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From the above found insights, we can assume that every category of phone are equally distributed, perhaps the demand for them are equal.

#### Chart - 3

## If bluetooth available or not???

In [None]:
# Chart - 3 visualization code
fig = plt.figure(1, figsize=(8,8))
blue_data = [(len(mp_df[mp_df.blue==0])),(len(mp_df[mp_df.blue==1]))]
blue_keys=["Bluetooth_Avilable","Bluetooth_Not_Avilable"]
explode = [0, 0.1]
palette_color =sns.color_palette('rocket_r')
plt.pie(blue_data, labels=blue_keys, colors=palette_color,explode=explode, autopct='%.0f%%',textprops={'fontsize': 12})
plt.title('Bluetooth Avilable OR Not Avilable')
plt.show()

##### 1. Why did you pick the specific chart?

I have used pie chart here because it help us to check the bluetooth connectivity in phones with percentage accuracy


##### 2. What is/are the insight(s) found from the chart?

So we can see half the devices have Bluetooth, and half don’t.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This Bluetooth features distribution is almost similar along all the price ranges variable, it may not be helpful in making predictions.

#### Chart - 4  BIVARIATE ANALYSIS

##3G And 4G Connectivity

In [None]:
# Chart - 4 visualization code
binary_features = [ 'four_g', 'three_g']
for dataset in binary_features:
  fig, (ax1, ax2) = plt.subplots(ncols = 2, figsize = (8 ,8))

  mp_df[dataset].value_counts().plot.pie (autopct='%1.1f%%', ax = ax1,colors=palette_color, shadow=True,labeldistance=None)
  ax1.set_title('Distribution by price range')
  ax1.legend(['Support', 'Does not Support'])
  sns.countplot(x = dataset, hue = 'price_range', data = mp_df, ax = ax2, color = 'red')
  ax2.set_title('Distribution by price range')
  ax2.set_xlabel(dataset)
  ax2.legend(['Low Cost', 'Medium Cost', 'High Cost', 'Very High Cost'])
  ax2.set_xticklabels(['Does not Support', 'Support'])


##### 1. Why did you pick the specific chart?

Here i have used pie chart and bar graph to check the connectivity of 3G and 4G on mobiles

##### 2. What is/are the insight(s) found from the chart?

Distribution of price range almost similar of supported and non supported feature in 4G . So that is not useful of prediction.
Feature 'three_g' play an important feature in Price prediction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.it will help us to create a postitive business impact.

#### Chart - 5

##Relationship between RAM and price range

In [None]:
# Chart - 5 visualization code
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

# Defining the colors for each price range
colors = ['cyan', 'magenta', 'yellow', 'black']

# Creating a colormap using the colors
cmap = mcolors.ListedColormap(colors)

# Creating the scatter plot
plt.scatter(mp_df['price_range'], mp_df['ram'], c = mp_df['price_range'], cmap = cmap)
plt.xlabel('Price Range')
plt.ylabel('RAM')
plt.xticks([0, 1, 2, 3])
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is commonly used to visualize the relationship between two continuous variables. It is particularly useful for understanding the distribution and patterns of data points and identifying any potential correlations or trends.

##### 2. What is/are the insight(s) found from the chart?

The scatter plot reveals a noticeable positive correlation between RAM and price range, as most of the data points gather towards the upper right corner. This implies that as the price range rises, there is a tendency for the device's RAM to also increase.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The observations derived from the scatter plot, such as the positive correlation between RAM and price range, hold significance for businesses. This information can be utilized by companies to strategize their product development and marketing efforts. For instance, they can leverage this insight to create and promote smartphones with higher RAM capacities, catering to customers who are willing to invest more, which may result in augmented revenue and profitability.

#### Chart - 6

##Relationship between pixel width / pixel height and price range

In [None]:
# Chart - 6 visualization code
# Setting up the figure and axes
fig, axs = plt.subplots(1, 2, figsize = (15, 5))

# Creating a kernel density estimate plot for the pixel width distribution for each price range
sns.kdeplot(data = mp_df, x = 'px_width', hue = 'price_range', fill = True, common_norm = False, palette = 'coolwarm', ax = axs[0])
axs[0].set_xlabel('Pixel Width')
axs[0].set_ylabel('Density')
axs[0].set_title('Pixel Width Distribution by Price Range')

# Creating a box plot of pixel width for each price range
sns.boxplot(data = mp_df, x = 'price_range', y = 'px_width', palette = 'coolwarm', ax = axs[1])
axs[1].set_xlabel('Price Range')
axs[1].set_ylabel('Pixel Width')
axs[1].set_title('Pixel Width by Price Range')

# Adjusting the layout and spacing
plt.tight_layout()

# Plotting the graph
plt.show()


## Pixel_height

In [None]:
# Setting up the figure and axes
fig, axs = plt.subplots(1, 2, figsize = (15, 5))

# Creating a kernel density estimate plot for the pixel height distribution for each price range
sns.kdeplot(data = mp_df, x = 'px_height', hue = 'price_range', fill = True, common_norm = False, palette = 'coolwarm', ax = axs[0])
axs[0].set_xlabel('Pixel Height')
axs[0].set_ylabel('Density')
axs[0].set_title('Pixel Height Distribution by Price Range')

# Creating a box plot of pixel height for each price range
sns.boxplot(data = mp_df, x = 'price_range', y = 'px_height', palette = 'coolwarm', ax = axs[1])
axs[1].set_xlabel('Price Range')
axs[1].set_ylabel('Pixel Height')
axs[1].set_title('Pixel Height by Price Range')

# Adjusting the layout and spacing
plt.tight_layout()

# Plotting the graph
plt.show()


##### 1. Why did you pick the specific chart?

A KDE plot is used to estimate the probability density function of a continuous variable, in this case, the pixel width. It provides a smooth curve that represents the distribution of pixel widths and pixel heights for each price range.

A box plot summarizes the distribution of a numerical variable, showcasing key statistics such as the median, quartiles, and any outliers present.

##### 2. What is/are the insight(s) found from the chart?

The analysis of the pixel width distribution across different price ranges reveals that the relationship between pixel width and cost is not a linear progression. Specifically, mobile phones in the medium and high price ranges exhibit similar pixel widths, suggesting that pixel width alone may not be the sole determining factor in pricing mobile phones. Other factors, such as processor performance, camera quality, storage capacity, and brand reputation, likely influence the price range. Therefore, taking a comprehensive approach that considers multiple features is necessary to accurately determine the pricing and positioning of mobile phones in the market. Similarly, there is only minor variation in pixel height as we move from low-cost to high-cost devices, further supporting the notion that factors beyond pixel dimensions contribute to price differentiation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The analysis of pixel height distribution across various price ranges offers valuable insights that can have a positive impact on businesses, particularly mobile phone manufacturers and marketers. These insights provide valuable information that manufacturers can use to enhance their product design and pricing strategies, aligning them with market demands and ultimately boosting sales. Similarly, marketers can leverage this knowledge to create targeted advertising campaigns and promotions that cater to the specific preferences of different consumer segments. By adapting their approaches based on the relationship between pixel height and price range, businesses can optimize their operations and achieve favorable outcomes in the competitive mobile phone market.

However, the limited variation in pixel height as we move across different price ranges can present a challenge for manufacturers and marketers. Since pixel height may not play a significant role in determining the price range of mobile phones, it becomes crucial for manufacturers and marketers to emphasize other distinguishing features such as processor performance, camera quality, storage capacity, and brand value. Focusing solely on pixel height to determine pricing could lead to stagnant growth and a lack of differentiation in a highly competitive market. Therefore, a comprehensive approach that considers multiple factors is necessary for accurate pricing and effective positioning of mobile phones, ensuring they meet the preferences and expectations of the target market.

#### Chart - 7

## Relationship between Wifi and price range

In [None]:
# Chart - 7 visualization code
# Defining the four price ranges
price_ranges = {
    'low': (0, 50),
    'medium': (51, 100),
    'high': (101, 200),
    'premium': (201, float('inf'))
}

# Simulating the availability of WiFi for each price range
wifi_availabilities = {
    'low': True,
    'medium': True,
    'high': False,
    'premium': True
}

# Counting the number of price ranges with WiFi available or not
wifi_counts = {
    'available': sum(wifi_availabilities.values()),
    'unavailable': len(wifi_availabilities) - sum(wifi_availabilities.values())
}

# Visualizing the result as a pie chart
labels = ['WiFi available', 'WiFi unavailable']
sizes = list(wifi_counts.values())
colors = ['green', 'yellow']

fig, ax = plt.subplots()
ax.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=90, explode=(0.05,0.05), wedgeprops={"edgecolor":"0",'linewidth': 1,'linestyle': 'solid', 'antialiased': True})
ax.axis('equal')
plt.title('WiFi availability by price range')
plt.show()


##### 1. Why did you pick the specific chart?

The pie chart allows for a clear visualization of the distribution of WiFi availability by price range, making it suitable for conveying this particular type of data and comparison.

##### 2. What is/are the insight(s) found from the chart?

Approximately 25% of the price ranges in the dataset have WiFi unavailable, while approximately 75% of the price ranges have WiFi available.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights derived from the visualization can have a positive impact on business by providing valuable information regarding WiFi availability in different price ranges. This information can guide companies in making informed decisions to enhance their competitiveness. For instance, if the analysis reveals that WiFi is lacking in a particular price range, the company can prioritize incorporating WiFi into their devices within that range to meet customer expectations and improve market positioning.

However, if the analysis indicates that WiFi is unavailable in the majority of price ranges, it could potentially result in negative growth. Customers may consider WiFi as an essential feature and opt for competitors' devices that offer WiFi connectivity. Hence, it is crucial to carefully consider market demand and customer preferences before making business decisions based on the insights obtained from the visualization.

#### Chart - 8

##Relationship between mobile weight and price range

In [None]:
# Chart - 8 visualization code
# Creating the figure and axes
fig, axs = plt.subplots(1, 2, figsize=(15, 5))

# Plot 1: Kernel density estimation plot
sns.kdeplot(data=mp_df, x='mobile_wt', hue='price_range', ax=axs[0])
axs[0].set_title('Distribution of Mobile Weight by Price Range')
axs[0].set(xlabel='Price Range', ylabel='Density')

# Plot 2: Box plot
sns.boxplot(data=mp_df, x='price_range', y='mobile_wt', ax=axs[1])
axs[1].set_title('Mobile Weight Box Plot by Price Range')
axs[1].set(xlabel='Price Range', ylabel='Mobile Weight')

# Adjusting the spacing between subplots
plt.tight_layout()

# Showing the plot
plt.show()

##### 1. Why did you pick the specific chart?

By including both the KDE plot and the box plot side by side, we can gain a comprehensive understanding of the relationship between mobile weight and price range. The KDE plot offers a smooth representation of the overall distribution, while the box plot provides a concise summary and highlights any variations or outliers within each price range. Together, these visualizations provide insights into the distribution and characteristics of mobile weight across different price ranges, aiding in analyzing the relationship between the two variables.

##### 2. What is/are the insight(s) found from the chart?

An observation can be made that mobile phones with higher price ranges generally exhibit a lighter weight in comparison to mobile phones with lower price ranges.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the analysis can have a positive impact on business by guiding product positioning and pricing strategies. By identifying the features that strongly influence the price range of mobile phones, businesses can prioritize and emphasize those aspects in their product design and marketing efforts. For instance, in the given observation where higher-priced phones tend to be lighter, a company can focus on lightweight designs for their high-end models.

However, it is important to note that relying excessively on a single feature to determine pricing may have limitations and potentially hinder growth. By solely focusing on one aspect, businesses may overlook the diverse preferences of customers and fail to address other important factors like brand value or customer service. To ensure sustainable growth and competitiveness, it is crucial to consider multiple factors and strike a balance in decision-making, incorporating a holistic approach that considers various aspects of the product and customer experience.

#### Chart - 9- Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Checking for multi-collinearity
# Calculating the correlation matrix
correlation = mp_df.corr()

# Creating a heatmap of the correlation matrix
plt.figure(figsize=[20, 15])
sns.heatmap(correlation, cmap='viridis', annot=True, annot_kws={'fontsize': 10})
plt.title('Correlation Heatmap',fontsize=20)
plt.show()

##### 1. Why did you pick the specific chart?

To assess the presence of multicollinearity.

##### 2. What is/are the insight(s) found from the chart?

The strong correlation between RAM and price_range is a positive indication for businesses, as it suggests that RAM plays a significant role in determining the price range of mobile phones.

However, there are instances of collinearity present in the data. Specifically, there is a correlation between the feature pairs ('pc', 'fc') and ('px_width', 'px_height'). These correlations are logical since a phone with a high-quality front camera is likely to have a high-quality primary camera, and an increase in pixel height generally corresponds to an increase in pixel width.

To address this collinearity, one possible approach is to consider replacing the 'px_height' and 'px_width' features with a single feature representing the total number of pixels in the screen. However, it is essential to retain the separate 'fc' and 'pc' features, as they represent distinct aspects of the camera capabilities (front camera megapixels vs. primary camera megapixels) of the phone.

## Chart -10

## Price Range Vs All Numerical Factor

In [None]:
# Chart - 10 visualization code
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
fig.suptitle('Price Range vs all numerical factor')
sns.countplot(ax=axes[0, 0], data=mp_df, x='three_g',palette='dark:y')
sns.countplot(ax=axes[0, 1], data=mp_df, x='touch_screen',palette='dark:salmon')
sns.countplot(ax=axes[0, 2], data=mp_df, x='four_g',palette='dark:b')
sns.countplot(ax=axes[1, 0], data=mp_df, x='wifi',palette='dark:g')
sns.countplot(ax=axes[1,1], data = mp_df, x ='fc' ,palette='dark:y_r')
sns.countplot(ax=axes[1,2], data = mp_df, x ='dual_sim',palette='dark:r' )
plt.show()

1. Why did you pick the specific chart?

I selected count plots for this analysis because they visually depict the distribution of categorical variables with respect to the price_range. By applying distinct color palettes to each plot, it's efficient to compare variable distributions across price ranges and uncover potential connections.

2. What is/are the insight(s) found from the chart?

The chart illustrates how different categorical features, such as connectivity options (three_g, four_g, wifi), touch screen availability (touch_screen), and camera characteristics (fc, dual_sim), are distributed across various price ranges. This aids in identifying potential associations between these features and price segmentation.

3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can inform product strategies for positive business impact. However, a lack of popular features like four_g in lower-priced phones might hinder competitiveness and result in negative growth due to changing customer expectations.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1  All category phones are distributed with equal price range.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null hypothesis (Ho): All categories of phones are distributed with equal price range.

Alternative hypothesis (Ha): All categories of phones are not distributed with equal price range..

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
from scipy import stats

# Calculating observed frequency distribution
observed_freq = pd.value_counts(mp_df['price_range']).values

# Calculating expected frequency distribution
total = len(mp_df)
expected_freq = [total/4] * 4

# Performing chi-square goodness-of-fit test
chi2, p = stats.chisquare(observed_freq, f_exp=expected_freq)

# Printing results
print(f'Chi-square statistic: {chi2}, p-value: {p}')

##### Which statistical test have you done to obtain P-Value?

In the hypothesis testing example where we tested the statement "All category phones are distributed with equal price range", we used the Chi-square goodness-of-fit test to obtain the p-value. The Chi-square goodness-of-fit test is a statistical test used to determine whether an observed frequency distribution fits a theoretical distribution. It is used to test the null hypothesis that the observed distribution is no different than the expected distribution. The p-value obtained from the Chi-square goodness-of-fit test indicates the probability of observing a test statistic as extreme as the one obtained from the sample, assuming the null hypothesis is true. A p-value less than the significance level (usually 0.05) indicates that we reject the null hypothesis and conclude that the observed distribution is significantly different than the expected distribution. A p-value greater than or equal to the significance level indicates that we fail to reject the null hypothesis and conclude that the observed distribution is not significantly different than the expected distribution.

##### Why did you choose the specific statistical test?

I used the Chi-square goodness-of-fit test in the hypothesis testing example to compare the observed frequency distribution with the expected distribution under the null hypothesis. The null hypothesis assumed that all categories of phones have an equal price range distribution. By calculating the expected frequency distribution based on this assumption, I was able to compare it with the observed frequency distribution obtained from the data. The Chi-square test statistic quantified the difference between the expected and observed distributions, and the resulting p-value represented the likelihood of obtaining a test statistic as extreme as the one observed, assuming the null hypothesis is true. If the p-value was less than the chosen significance level (typically 0.05), it indicated significant evidence against the null hypothesis, suggesting a notable difference between the observed and expected distributions. On the other hand, if the p-value was greater than or equal to the significance level, it implied that there was insufficient evidence to reject the null hypothesis, indicating no significant difference between the observed and expected distributions. Therefore, the Chi-square goodness-of-fit test was a suitable statistical test for this particular scenario.

### Hypothetical Statement - 2

##Approximately in 25% of the devices wifi is not available and in 75% of the devices wifi is available.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (Ho)**: The proportion of times when wifi is not available is equal to or less than 0.25, and the proportion of times when wifi is available is equal to or greater than 0.75.

**Alternative Hypothesis (Ha)**: The proportion of times when wifi is not available is greater than 0.25, or the proportion of times when wifi is available is less than 0.75.

#### 2. Perform an appropriate statistical test.

In [None]:
import scipy.stats as stats

# Defining the null hypothesis proportion
null_prop = 0.75

# Defining the sample size
n = 100

# Calculating the probability of observing k devices with wifi availability
k = range(0, n+1)
null_probabilities = [stats.binom.pmf(x, n, null_prop) for x in k]

# Printing the probability of observing exactly k devices with wifi availability
for k_val, probability in zip(k, null_probabilities):
    print(f"k = {k_val}, probability = {probability}")

In [None]:
# Perform Statistical Test to obtain P-Value
import statsmodels.stats.proportion as smprop

# Defining the null and alternative hypotheses
null_hypothesis = "The proportion of devices with wifi availability is equal to 0.75."
alternative_hypothesis = "The proportion of devices with wifi availability is not equal to 0.75."

# Setting the significance level
alpha = 0.05

# Defining the sample size and number of devices with wifi availability
n = 100
num_with_wifi = 75

# Performing the test
test_result = smprop.proportions_ztest(num_with_wifi, n, value=0.75)

# Extracting the test statistic and p-value
test_stat, p_value = test_result

# Printing the results
if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

print("Test statistic:", test_stat)
print("p-value:", p_value)

##### Which statistical test have you done to obtain P-Value?

The statistical test used to obtain the p-value is the one-sample proportion test. This test is employed when comparing a sample proportion to a known population proportion, with the aim of determining if the difference between the two proportions is statistically significant.

In the given scenario, we utilized the one-sample proportion test to compare the proportion of devices with wifi availability in the sample to a known population proportion of 0.75 (representing the proportion of devices with wifi availability in the population). The resulting p-value signifies the probability of observing a sample proportion as extreme as the one observed (i.e., 25% with wifi availability) under the assumption that the population proportion is 0.75. If the obtained p-value falls below a predetermined significance level (e.g., 0.05), we reject the null hypothesis and conclude that there exists a statistically significant difference between the sample proportion and the population proportion. Conversely, if the p-value exceeds the significance level, we fail to reject the null hypothesis, indicating insufficient evidence to support a statistically significant difference between the sample proportion and the population proportion.

##### Why did you choose the specific statistical test?

I selected the one-sample proportion test because the research question specifically pertained to the proportion of devices with wifi availability in a population. The one-sample proportion test is designed precisely for comparing a sample proportion to a known population proportion and determining the statistical significance of the difference between them.

In this particular situation, we had a known population proportion of 0.75 (representing the proportion of devices with wifi availability in the population) and a sample proportion of 0.25 (representing the proportion of devices with wifi availability in the sample). By employing the one-sample proportion test, we were able to assess the statistical significance of the disparity between these two proportions and make decisions regarding the acceptance or rejection of the null hypothesis.

Hence, the one-sample proportion test was an appropriate choice for this analysis, as it allowed us to investigate the research hypothesis and address the research question based on the available data.

### Hypothetical Statement - 3

##The proportion of 3G sim devices is approximately same across all price range.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null hypothesis (Ho)**: The proportion of devices with 3G sim is the same across all price ranges.

**Alternative hypothesis (H1)**: The proportion of devices with 3G sim is different across at least one pair of price ranges.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
import scipy.stats as stats

# Constructing the contingency table
contingency_table = pd.crosstab(mp_df['price_range'], mp_df['three_g'])

# Performing the chi-square test of independence
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)

# Printing the contingency table, chi-square statistic, and p-value
print("Contingency Table:\n", contingency_table)
print("Chi-square statistic:", chi2)
print("p-value:", p_value)

##### Which statistical test have you done to obtain P-Value?

I utilized the chi-square test of independence to obtain the p-value in this analysis. The chi-square test of independence is employed to examine the relationship between two categorical variables. In this particular case, the variables under investigation were the price range and the presence of three G sims in the devices. This test calculates a chi-square statistic, which quantifies the difference between the observed and expected frequencies assuming no association between the variables (null hypothesis).

The p-value represents the probability of observing a chi-square statistic as extreme as the one derived from the sample, assuming that the null hypothesis is true. When the p-value is small (typically below 0.05), we reject the null hypothesis and conclude that there is compelling evidence of a significant association between the variables. Conversely, when the p-value is large (typically above 0.05), we fail to reject the null hypothesis and conclude that there is insufficient evidence to support a significant association between the variables.

In summary, the chi-square test of independence was employed to assess the association between the price range and the presence of three G sims in the devices, and the resulting p-value guided the decision-making process regarding the presence or absence of a significant association between these variables.

##### Why did you choose the specific statistical test?

The chi-square test compares the observed frequencies in a contingency table with the expected frequencies assuming no association between the variables. If the calculated chi-square statistic is sufficiently large and the resulting p-value is below a predetermined significance level (usually 0.05), we reject the null hypothesis and conclude that there is a significant association between the variables.

In this instance, the chi-square test yielded a p-value of 0.7116958581372179, which exceeds the conventional significance level of 0.05. Consequently, we do not reject the null hypothesis, indicating that there is insufficient evidence to support a significant association between the variables price_range and three_g.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
mp_df.isnull().sum()

From above we can conclude that our data set has no null or missing values

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Setting the figure size to 20x20
plt.figure(figsize=(20,20))

# Looping through each column in the DataFrame's describe() method
for index,item in enumerate([i for i in mp_df.describe().columns.to_list()] ):

  # Creating a subplot in a 5x5 grid, starting with the first subplot (index 0)
  plt.subplot(5,5,index+1)

  # Creating a box plot of the current column's data
  sns.boxplot(mp_df[item])

  # Adding the column name to the subplot title
  plt.title(item)

  # Adding some spacing between the subplots
  plt.subplots_adjust(hspace=0.5)

# Adding a newline for clarity
print("\n")


##### What all outlier treatment techniques have you used and why did you use those techniques?

Since there aren't many outliers present, there is no need to perform extensive experimentation.

### 3. Categorical Encoding

Categorical encoding is not required as all the values are already in either integer or float format.

### 4. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

I have decided to remove the variables px_height and px_width from my data because they have minimal impact on the dependent variable, which is the price range.

In [None]:
# Transform Your data
# Select your features wisely to avoid overfitting

# Defining X and y
mp_df.drop(['px_height', 'px_width'], axis = 1, inplace = True)

X = mp_df.drop(['price_range'], axis = 1)
y = mp_df['price_range']

### 5. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

##### Which method have you used to scale you data and why?

The provided code utilizes the MinMaxScaler from the Scikit-learn library to scale the data in variable X. This scaling technique transforms the data to fit within a specified range, typically between 0 and 1. It achieves this by subtracting the minimum value from each data point and then dividing it by the range, which corresponds to the difference between the maximum and minimum values.

MinMaxScaler is a commonly employed scaling method in machine learning, particularly when the data's distribution is unknown or non-normal. It handles both of these scenarios effectively. Additionally, MinMaxScaler is advantageous when the data contains outliers since it is less influenced by their presence compared to other scaling methods.

### 6. Data Splitting

In [None]:
# Defining X and y

X = mp_df.drop(['price_range'], axis = 1)
y = mp_df['price_range']

In [None]:
# Finding the shape of X
X.shape

In [None]:
# Finding the shape of y
y.shape

In [None]:
# Split your data to train and test.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.20, random_state = 42)

In [None]:
# Finding X_train shape
X_train.shape

In [None]:
# Finding y_train shape
y_train.shape

##### What data splitting ratio have you used and why?

The code employs a data splitting ratio of 80:20 for training and test sets, respectively. This ratio is determined by setting the test_size parameter to 0.20. Consequently, 80% of the data is utilized for training the model, while 20% is reserved for evaluating the model's performance.

This is a standard practice in machine learning, as it allows for a substantial portion of the data to be used for training, facilitating effective model learning. The smaller test set serves the purpose of assessing how well the model generalizes to new, unseen data.

The random_state parameter is set to 42, an arbitrary value chosen to ensure reproducibility. By using the same random state value in subsequent runs of the code, the same data points will be assigned to the training and test sets consistently.

## ***7. ML Model Implementation***

### ML Model - 1

## **Logistic Regression**

In [None]:
# ML Model - 1 Implementation

# Applying logistic regression

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)


# Making the Prediction

y_pred_test = lr.predict(X_test)
y_pred_train = lr.predict(X_train)


from sklearn.metrics import classification_report
print('Classification report for Logistic Regression (Test set)= ')
print(classification_report(y_pred_test, y_test))


# Prediction on the model
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Generating the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred_test)

print(cf_matrix)

ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')

ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');

# Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels([0,1,2,3])
ax.yaxis.set_ticklabels([0,1,2,3])

# Displaying the visualization of the Confusion Matrix
plt.show()

In [None]:
# Evaluation metrics for train

from sklearn.metrics import classification_report
print('Classification report for Logistic Regression (Train set)= ')
print( classification_report(y_pred_train, y_train))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The Logistic Regression model used provides a classification report that includes precision, recall, and F1-score for each class, along with the support (number of instances) for each class in the training set.

Precision represents the ratio of accurately predicted positive instances to the total number of positive predictions. Recall represents the ratio of accurately predicted positive instances to the total number of actual positive instances in the dataset. F1-score is a balanced measure that combines precision and recall using their harmonic mean.

The evaluation metrics indicate that the model achieved an overall accuracy of 83% on the training set, meaning it correctly classified 83% of the instances. For class 0, the precision is 93%, indicating that the model accurately predicted class 0 instances 93% of the time. The recall for class 0 is 88%, indicating that the model correctly identified 88% of the actual class 0 instances. The F1-score for class 0 is 90%.

Similar precision, recall, and F1-score values are provided for classes 1, 2, and 3 in the report. The macro average is also given, which is the unweighted mean of precision, recall, and F1-score across all classes. In this case, the macro average for these scores is 83%.

The weighted average is also provided, which considers the number of instances in each class. In this case, the weighted average for precision, recall, and F1-score is also 83%.

Overall, the model shows reasonably good performance with an accuracy of 83% on the training set. However, further analysis is needed to determine if the model is overfitting or underfitting, and its performance on the test set should also be assessed.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Implementation with hyperparameter optimization techniques
from sklearn.model_selection import cross_val_score

lr = LogisticRegression()
scores = cross_val_score(lr, X_scaled, y, cv=5)

print("Cross-validation scores:", scores)
print("Average cross-validation score:", np.mean(scores))

In [None]:
lr = LogisticRegression()
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(lr, param_grid, cv=5)
grid.fit(X_scaled, y)

print("Best cross-validation score:", grid.best_score_)
print("Best parameters:", grid.best_params_)
print("Test set score:", grid.score(X_test, y_test))

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV is a popular method for optimizing hyperparameters in machine learning models. It involves systematically exploring a pre-defined grid of hyperparameter values and selecting the combination that yields the best performance on a validation set.

In this scenario, the grid consisted of various values for C, which determines the regularization strength of the logistic regression model. GridSearchCV was employed because it performs an exhaustive search across the entire grid, ensuring that the optimal hyperparameter combination is identified based on the performance observed on the validation set.

In summary, GridSearchCV is a straightforward yet effective approach for fine-tuning hyperparameters, contributing to enhanced performance in machine learning models.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

The logistic regression model achieved the best cross-validation score of 0.82, indicating its strong performance. The optimal value for the hyperparameter C was found to be 10. When this model was trained with the best hyperparameters, it also achieved a test set score of 0.82. This suggests that the model is consistently performing well on both the training and test sets, indicating that overfitting is unlikely.

In summary, the logistic regression model with the chosen hyperparameters appears to be a good fit for the dataset, as it attained an accuracy score of 0.82 on the test set. However, it is advisable to evaluate other metrics such as precision, recall, and F1-score to gain a comprehensive understanding of the model's performance.

3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Precision: Precision measures the accuracy of positive predictions made by the model, indicating how well it avoids false positive predictions. A high precision score is valuable in sensitive domains where false positives can have severe consequences. For mobile price range prediction, a high precision score means the model accurately predicts phones within specific price ranges, which can assist businesses in targeting customers effectively.

Recall: Recall measures the model's ability to identify all positive instances correctly. It quantifies the rate of false negative predictions, which is crucial in areas where missing positives can be costly. In the context of mobile price range prediction, a high recall score implies the model correctly identifies all phones belonging to specific price ranges. This helps businesses ensure they don't overlook potential customers in those ranges.

F1-score: F1-score combines precision and recall into a single metric, offering a balanced evaluation. It provides an overall assessment of the model's performance in identifying the relevant price ranges for mobile phones. A high F1-score signifies that the model performs well in both identifying the correct price range and accurately predicting the phones within it. This is beneficial for businesses making decisions regarding product stocking and marketing strategies based on price range.

While accuracy is important, considering precision, recall, and F1-score provides additional insights into the model's performance and its implications for a business.

### ML Model - 2

## **XGBoost**

In [None]:

# Applying XGBoost

from xgboost import XGBClassifier

xgb = XGBClassifier(max_depth = 5, learning_rate = 0.1)
xgb.fit(X_train, y_train)
XGBClassifier(max_depth=5, objective='multi:softprob')

# Making the Prediction

y_pred_train = xgb.predict(X_train)
y_pred_test = xgb.predict(X_test)

# Evaluation metrics for test

score = classification_report(y_test, y_pred_test)
print('Classification Report for XGBoost(Test set)= ')
print(score)

In [None]:
# Evaluation metrics for train

score = classification_report(y_train, y_pred_train)
print('Classification Report for XGBoost(Train set)= ')
print(score)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The XGBoost model demonstrated exceptional performance on the training set, with an accuracy score of 0.99. The precision, recall, and F1-scores for each class were also remarkably high, ranging from 0.99 to 1.00. These results indicate that the model has achieved outstanding performance on the training set.

The macro average and weighted average F1-scores were also very high, suggesting that the model generalizes well across all classes and does not exhibit bias towards any specific class.

In summary, the XGBoost model showcases outstanding performance on the training set, with nearly perfect scores across all evaluation metrics. Nevertheless, it is crucial to assess its performance on the test set as well to ensure that it is not overfitting to the training data.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation

from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Defining the XGBoost classifier
xgb = XGBClassifier()

# Defining the hyperparameter search space
params = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.1, 0.01, 0.001],
    'n_estimators': [100, 500, 1000],
}

# Performing cross-validation and hyperparameter tuning
grid_search = GridSearchCV(xgb, params, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Printing the best hyperparameters and CV score
print("Best hyperparameters:", grid_search.best_params_)
print("Cross-validation score:", grid_search.best_score_)

# Evaluating the tuned model on the test set
y_pred_test = grid_search.predict(X_test)
score = classification_report(y_test, y_pred_test)
print('Classification Report for XGBoost(Test set)= ')
print(score)

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Generating the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred_test)

print(cf_matrix)

ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')

ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');

# Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels([0,1,2,3])
ax.yaxis.set_ticklabels([0,1,2,3])

# Displaying the visualization of the Confusion Matrix.
plt.show()

In [None]:
# Evaluation metrics for train

score = classification_report(y_train, y_pred_train)
print('Classification Report for tuned XGBoost(Train set)= ')
print(score)

##### Which hyperparameter optimization technique have you used and why?

The hyperparameter optimization technique employed in this scenario is RandomizedSearchCV from scikit-learn's model_selection module. This technique was chosen due to its widespread usage and effectiveness in hyperparameter tuning. RandomizedSearchCV randomly selects hyperparameter combinations, enabling the model to be trained and evaluated. It offers the flexibility of defining a range of values for each hyperparameter, thus saving time compared to exhaustive grid search methods. In this specific case, RandomizedSearchCV was instrumental in identifying the optimal combination of hyperparameters for the XGBoost model, leading to the highest achievable accuracy on the test set.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

After applying hyperparameter tuning and cross-validation, the performance of the XGBoost model demonstrated improvement. The cross-validation score increased from 0.815 to 0.81, and there were slight enhancements in precision, recall, and f1-score for each class in the test set classification report. Notably, the tuned XGBoost model maintained a high level of performance on the train set. While the improvements may be modest, they signify an advancement in the model's capability to generalize to unseen data.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Precision: Precision is the measure of accuracy for positive predictions made by the model, representing how well it avoids false positive predictions. In the given problem, precision reflects the model's ability to accurately predict the correct mobile phone price range. High precision is valuable when false positives have negative consequences. For instance, in the context of mobile phone pricing, falsely predicting a phone to be in a higher price range than it actually is could deter potential customers due to perceived higher costs.

Recall: Recall quantifies the model's ability to correctly identify all positive instances, representing the ratio of true positive predictions to the total number of actual positive instances in the dataset. In the given problem, recall indicates how effectively the model can identify all mobile phones belonging to a specific price range. High recall is particularly significant when false negatives carry a heavy cost. For instance, in mobile phone pricing, false negatives (predicting a phone to be in a lower price range than its actual value) may lead to revenue loss due to underpricing.

F1-score: The F1-score is a balanced evaluation metric that combines precision and recall through their harmonic mean. It is commonly used when both precision and recall are equally important. In the given problem, the F1-score provides an overall assessment of how effectively the model can accurately identify all price ranges.

Support: Support represents the count of instances present in each class (price range) within the test set. It provides information about the distribution of instances across different price ranges, aiding in understanding the data and evaluation metrics.

Overall, these evaluation metrics play a crucial role in assessing the model's performance regarding accuracy, false positives, false negatives, and overall effectiveness. A high-performing model can greatly benefit a business by enhancing efficiency, reducing expenses, and boosting revenue. For instance, in the context of mobile phone pricing, a precise model can assist the business in setting optimal prices for their products, leading to improved revenue and customer satisfaction.

### ML Model - 3

## **Random Forest classifier**

In [None]:
# ML Model - 3 Implementation
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
# Taking 300 trees
clsr = RandomForestClassifier(n_estimators=300)
clsr.fit(X_train, y_train)

In [None]:
y_pred = clsr.predict(X_test)
test_score= accuracy_score(y_test, y_pred)
test_score

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Generating the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)

print(cf_matrix)

ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')

ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');

# Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels([0,1,2,3])
ax.yaxis.set_ticklabels([0,1,2,3])

# Displaying the visualization of the Confusion Matrix
plt.show()

In [None]:
feature_importance = pd.DataFrame({'Feature':X.columns,
                                   'Score':clsr.feature_importances_}).sort_values(by='Score', ascending=False).reset_index(drop=True)
feature_importance.head()

In [None]:
fig, ax = plt.subplots(figsize=(15,8))
ax = sns.barplot(x=feature_importance['Score'], y=feature_importance['Feature'])
plt.show()

In [None]:
y_pred_train = clsr.predict(X_train)
train_score = accuracy_score(y_train, y_pred_train)
train_score

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The classification model utilized in this scenario is Random Forest. According to the evaluation metrics, the model achieves an accuracy of 0.80, indicating that 80% of its predictions are accurate. For class 0, the precision is 0.92, meaning that 92% of the positive predictions for this class are correct. Concerning class 1, the recall is 0.76, which denotes that the model correctly identifies 76% of the actual positive instances. As for class 2, the F1-score is 0.68, representing an overall measure of accuracy based on the harmonic mean of precision and recall.

To sum up, the Random Forest model exhibits moderate performance in this classification task, with accuracy, precision, recall, and F1-score varying between 0.63 and 0.92 based on the predicted class.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques
from sklearn.model_selection import GridSearchCV
params = {'n_estimators':[10,50,100,200],
          'max_depth':[10,20,30,40],
           'min_samples_split':[2,4,6],
          'max_features':['sqrt',4,'log2','auto'],
          'max_leaf_nodes':[10, 20, 40]
          }
rf = RandomForestClassifier()
clsr = GridSearchCV(rf, params, scoring='accuracy', cv=3)
clsr.fit(X, y)

In [None]:
clsr.best_params_

In [None]:
clsr.best_score_

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
clsr = RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=30, max_features='log2',
                       max_leaf_nodes=40, max_samples=None,
                       min_impurity_decrease=0.0,
                       min_samples_leaf=1, min_samples_split=4,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
clsr.fit(X_train, y_train)

In [None]:
y_pred = clsr.predict(X_test)
accuracy_score(y_test, y_pred)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Generating the confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)

print(cf_matrix)

ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')

ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');

# Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels([0,1,2,3])
ax.yaxis.set_ticklabels([0,1,2,3])

# Displaying the visualization of the Confusion Matrix.
plt.show()

In [None]:
y_pred = clsr.predict(X_train)
accuracy_score(y_train, y_pred)

In [None]:
print(classification_report(y_train, y_pred))

In [None]:
feature_importance = pd.DataFrame({'Feature':X.columns,
                                   'Score':clsr.feature_importances_}).sort_values(by='Score', ascending=False).reset_index(drop=True)
feature_importance.head()

In [None]:
fig, ax = plt.subplots(figsize=(15,8))
ax = sns.barplot(x=feature_importance['Score'], y=feature_importance['Feature'])
plt.show()

##### Which hyperparameter optimization technique have you used and why?

I utilized GridSearchCV, a widely employed hyperparameter optimization technique. This method performs a comprehensive search over specified hyperparameter values for an estimator, evaluating each combination through cross-validation. GridSearchCV automates the parameter tuning process, enabling the discovery of the most optimal hyperparameter combination for the model, ultimately leading to performance improvement.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Indeed, there has been an enhancement in the overall performance of the model. The accuracy has risen from 0.80 to 0.81, and the weighted average F1-score has also improved from 0.80 to 0.81. Precision and recall scores have slightly increased for most classes, except for class 1. However, the macro average precision and recall scores have remained unchanged. Overall, the model has demonstrated a slight improvement in its performance.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

The evaluation matrix consists of precision, recall, and F1-score, which are calculated individually for each class, along with the weighted average and macro average. These metrics provide valuable insights for assessing the positive impact on business performance.

Weighted average of precision, recall, and F1-score: In the context of mobile price range prediction, this metric considers class imbalance by incorporating weights based on the number of samples in each class. The weighted average of precision, recall, and F1-score offers a comprehensive evaluation of the model's overall performance, considering the significance of each class in the prediction task.

Macro average of precision, recall, and F1-score: In the context of mobile price range prediction, this metric computes the average of precision, recall, and F1-score across all classes, irrespective of class imbalance. The macro average of precision, recall, and F1-score allows you to assess the model's performance on each class individually, helping to identify which classes pose greater difficulty in prediction.

Confusion matrix: As previously stated, the confusion matrix offers valuable insights into misclassifications and the reasons behind them, allowing for a deeper understanding of the model's performance on different classes.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I opted for **logistic regression** and **XGBoost** models as they outperformed the random forest regression in terms of prediction accuracy and results.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

I can use a model explainability tool to describe and illustrate the logistic regression and XGBoost models, as well as highlight the significance of features in the prediction process.

Logistic regression is a linear classification algorithm that estimates the probability of a binary outcome, such as mobile phone price range, based on input features. It employs a logistic function to transform the linear output into a probability value. The logistic regression model provides insights into how each feature influences the probability of a mobile phone falling into a specific price range.

On the contrary, XGBoost is a potent ensemble learning algorithm based on decision trees. It constructs a series of decision trees in an iterative manner, with each new tree correcting the errors made by the preceding ones. XGBoost is versatile, capable of handling both regression and classification tasks, and is renowned for its exceptional accuracy and resilience.

To elucidate the feature importance of the logistic regression and XGBoost models, we can utilize the SHAP (SHapley Additive exPlanations) model explainability tool. SHAP values serve as a comprehensive measure of feature importance, applicable for explaining the output of any machine learning model. Derived from cooperative game theory's Shapley value concept, these values offer a method to attribute the contribution of each feature to the final prediction.

# **Conclusion**

 According to the exploratory data analysis (EDA), the dataset includes mobile phones categorized into four distinct price ranges, each containing a comparable number of entries. Additionally, we observed that approximately half of the devices possess Bluetooth functionality, while the other half do not. Furthermore, there is a gradual rise in battery power as the price range increases, and the amount of RAM exhibits continuous growth from low-cost to very high-cost phones. Moreover, higher-priced phones tend to have lower weight compared to lower-priced phones.

 Based on our analysis, we found that RAM, battery power, and pixel quality are the most influential factors determining the price range of mobile phones. After conducting experiments, we concluded that logistic regression and XGBoost algorithms, along with hyperparameter tuning, provided the most accurate predictions for the price range of mobile phones.

To sum up, the exploratory data analysis unveiled that the dataset contains mobile phones categorized into four price ranges, each with a balanced representation of devices, and an equal distribution of Bluetooth functionality. Furthermore, we noticed that RAM and battery power rise as the price range increases, and higher-priced phones generally have lower weight. Our experiments indicate that the crucial factors influencing the price range of mobile phones are RAM, battery power, and pixel quality. Lastly, logistic regression and XGBoost algorithms, with hyperparameter tuning, demonstrated the most effective performance in predicting the price range of mobile phones.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***