# Mobile Price Classfication 


### Context 
Bob has started his own mobile company. He wants to give tough fight to big companies like Apple,Samsung etc.

He does not know how to estimate price of mobiles his company creates. In this competitive mobile phone market you cannot simply assume things. To solve this problem he collects sales data of mobile phones of various companies.

Bob wants to find out some relation between features of a mobile phone(eg:- RAM,Internal Memory etc) and its selling price. But he is not so good at Machine Learning. So he needs your help to solve this problem.

In this problem you do not have to predict actual price but a price range indicating how high the price is

### Predicting  Mobile Price Range 
This notebook will go through exploring the data set with the goal of predicting the price range of a mobile based on their specification. 

This is a classification problem with a labels of: 

- 0 = low cost 
- 1 = medium cost 
- 2 = high cost 
- 3 = very high cost

## Importing Dependencies 

In [None]:
# Data Manipulation and Data Visualization
%matplotlib inline 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

# Preprocessing 
from sklearn.preprocessing import MinMaxScaler

# Model Evaluator
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score


# Machine Learning models 
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

# Ignore warnings
import warnings 
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
                        

## Exploratory Data Analysis 

### Attributes Dictionary

- battery power - Total energy a battery can store in one time, measured in mAh
- blue - has bluetooth or not 
- clock speed - speed at which microprocessor executes instructions
- dual_sim - has dual sim support or not 
- fc - front camera mega pixels
- four_g - has 4G or not 
- int_memory- internal memory in gigabytes 
- m_dept = mobile depth in cm 
- mobile_wt - weight of mobile phone 
- n_cores - number of cores of processor 
- pc - primary camera mega pixels
- px_height - pixel resolution height 
- px_width - pixel resolution width 
- ram - random access memory in mega bytes
- sc_h - screen height of mobile in cm 
- sc_w - screen width of mobile in cm 
- talk_time - longest time that a single battery charge will last 
- three_g - has 3G or not 
- touch_screen - has touch screen or not 
- wifi - has wifi or not 
- price_range - the target variable with value of 0, 1, 2, 3

### Load the data 

In [None]:
mobile_train = pd.read_csv("/kaggle/input/mobile-price-classification/train.csv") 
mobile_test = pd.read_csv("/kaggle/input/mobile-price-classification/test.csv")

mobile_train.head()

In [None]:
mobile_test.head(10)

### Inspecting the data 

In [None]:
mobile_train.shape

In [None]:
# Checking Data types of the data
mobile_train.info()

Great, there is no null values in our data. We can process smoothly

In [None]:
# Checking Unique values
mobile_train.nunique()

In [None]:
mobile_train.price_range.value_counts()

In [None]:
mobile_train["price_range"].value_counts(normalize=True)

In [None]:
mobile_train.describe()

### Checking Relationship between the Variables

- For simplicity sake, we will only use bar graph for this notebook, but you can use anything you want as long as you can represent the data well

In [None]:
corr = mobile_train.corr()
corr

In [None]:
fig, ax = plt.subplots(figsize=(20, 20))
sns.heatmap(corr, annot=True, ax=ax, cmap="icefire");

As we can see, ram and our target variable price range is highly correlated to each other. 

px_width, px_height, and battery power are also have high values of correlation to our target variable. 

In [None]:
mobile_train.hist(figsize=(20, 15));

#### Plotting the different cameras with ram and n_cores to see if there is a relationship

In [None]:
def plot_rel(x, y, index, col1, col2): 
    """
    plotting the two features relative to its price range. 
    x: first attribute 
    y: second attribute 
    index: the target variable 
    col1: name of x 
    col2: name of y
    """
    fig, ((ax, ax2), (ax3, ax4)) = plt.subplots(figsize=(12, 6), ncols = 2, nrows=2)

    ax.bar(x[index == 0], y[index == 0], color=["salmon"])

    ax2.bar(x[index == 1], y[index == 1], color=["lightblue"])

    ax3.bar(x[index == 2], y[index == 2], color=["lightgreen"])

    ax4.bar(x[index == 3], y[index == 3])

    ax.set(xlabel=col1, ylabel=col2)
    ax2.set(xlabel=col1, ylabel=col2)
    ax3.set(xlabel=col1, ylabel=col2)
    ax4.set(xlabel=col1, ylabel=col2)
    plt.show()
    
def specs_rel(x, y, x_name, y_name): 
    """
    plotting the two features. 
    x: first attribute 
    y: second attribute 
    x_name: name of attirubute 1 (x)
    y_name: name of attribute 2 (y)
    """
    plt.barh(y, x, color=["lightblue"])
    plt.yticks(y)
    plt.xlabel(x_name)
    plt.ylabel(y_name);

In [None]:
plot_rel(mobile_train.fc, mobile_train.ram, mobile_train["price_range"], "Front Camera Pixels", "Ram")

In [None]:
plot_rel(mobile_train.pc, mobile_train.ram, mobile_train["price_range"], "Primary Camera(Mega pixel)", "Ram")

In [None]:
plot_rel(mobile_train.fc, mobile_train.n_cores, mobile_train["price_range"], "Front Camera Pixels", "N_cores")

In [None]:
plot_rel(mobile_train.pc, mobile_train.n_cores, mobile_train["price_range"], "Front Camera Pixels", "N_Cores")

In [None]:
specs_rel(mobile_train["ram"], mobile_train["price_range"], "Ram", "Price Range")

In [None]:
specs_rel(mobile_train["n_cores"], mobile_train["price_range"], "n_cores", "Price Range")

In [None]:
specs_rel(mobile_train["clock_speed"], mobile_train["price_range"], "Clock Speed", "Price Range")

In [None]:
specs_rel(mobile_train["ram"], mobile_train["n_cores"], "Ram", "n_cores")

In [None]:
plt.bar(mobile_train["price_range"], mobile_train["sc_h"], color=["salmon"])
plt.xlabel("Price Range")
plt.ylabel("Screen Height");

In [None]:
plt.bar(mobile_train["price_range"], mobile_train["sc_w"], color=["salmon"])
plt.xlabel("Price Range")
plt.ylabel("Screen Width")
plt.xticks(mobile_train["price_range"]);

In [None]:
plt.barh(mobile_train["sc_h"], mobile_train["battery_power"])

plt.yticks(mobile_train["sc_h"])
plt.xlabel("Battery Power")
plt.ylabel("Screen Height");

In [None]:
plt.barh(mobile_train["sc_w"], mobile_train["battery_power"])

plt.yticks(mobile_train["sc_w"])
plt.xlabel("Battery Power")
plt.ylabel("Screen width");

### Cleaning the data

In [None]:
mobile_train = mobile_train.drop("m_dep", axis=1)

In [None]:
mobile_test = mobile_test.drop("m_dep", axis=1)

Dropping the m_dep attribute because it has no effect to our data 

In [None]:
mobile_test = mobile_test.drop("id", axis=1)

Dropping the id in the test set to avoid errors in our prediction later 

In [None]:
mobile_train[mobile_train["sc_w"] == 0]
mobile_train.loc[mobile_train["sc_w"] == 0, "sc_w"] = mobile_train["sc_w"].median()

In [None]:
mobile_test[mobile_test["sc_w"] == 0]
mobile_test.loc[mobile_test["sc_w"] == 0, "sc_w"] = mobile_test["sc_w"].median()

Changing the 0 values of attribute sc_w to its median. We need to change this because in reality there is no 0 values of screen width

In [None]:
mobile_train.tail()


In [None]:
mobile_test.head()

In [None]:
mobile_train["sc_a"] = mobile_train["sc_h"] * mobile_train["sc_w"] 
mobile_train.drop(["sc_h", "sc_w"], axis=1, inplace=True)

Adding a new attribute called "sc_a" this is the area of a mobile, and we are also deleting the sc_w and sc_h attribute.

In [None]:
mobile_train.dtypes

In [None]:
mobile_test["sc_a"] = mobile_test["sc_h"] * mobile_test["sc_w"] 
mobile_test.drop(["sc_h", "sc_w"], axis=1, inplace=True)

In [None]:
mobile_test.dtypes

In [None]:
corr = mobile_train.corr()
corr["price_range"].sort_values(ascending=False)

## Modeling   

Now that we visualize our data and gain insights about it we are now ready for making a model

The models that we are going to use are RandomForest, SVC, Logistic Regression, XGboost, and Catboost 

We will only use XGboost and Catboost for initial training to see how will it performs to our data

In [None]:
mobile_train.head()

In [None]:
X = mobile_train.drop("price_range", axis=1)
y = mobile_train["price_range"]

In [None]:
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size = 0.2, random_state=42)

In [None]:
X_train.shape, X_test.shape

### Functions for modeling 

#### Initial Modeling

In [None]:
def unscaled_modeling(model, X_train, X_test, y_train, y_test):
    """
    Fits and Evaluates given machine learning model.
    models : Scikit-learn machine learning model. 
    X_train: training data (no labels)
    X_test : testing data (no labels)
    y_train: training labels
    y_test : test labels
    """
     # Setting random seed  
    np.random.seed(42)
     # Fit the training data
    model.fit(X_train, y_train)
    clf_score = model.score(X_test, y_test)
   
    return clf_score * 100

def scaled_modeling(model,scaler, X_train, X_test, y_train, y_test):
    """
    Fits and Evaluates given machine learning model.
    models : Scikit-learn machine learning model. 
    X_train: training data (no labels)
    X_test : testing data (no labels)
    y_train: training labels
    y_test : test labels
    
    """   
     # Setting random seed   
    np.random.seed(42)
    # Scale and transform the training and test data
    scaler.fit_transform(X_train)
    scaler.transform(X_test)
    # Fit the training data
    model.fit(X_train, y_train)
    clf_score = model.score(X_test, y_test)
    return clf_score * 100
    

In [None]:
# Models 
rf = RandomForestClassifier()
logistic_reg = LogisticRegression()
svc = SVC()
xgb = XGBClassifier()
catboost = CatBoostClassifier()

### Random Forest initital score

In [None]:
unscaled_modeling(rf, X_train, X_test, y_train, y_test)

### Logistic Regression initial score

In [None]:
unscaled_modeling(logistic_reg, X_train, X_test, y_train, y_test)

### SVC Initial score

In [None]:
scaled_modeling(svc, MinMaxScaler(), X_train, X_test, y_train, y_test)

### XGBoost Initial Score

In [None]:
unscaled_modeling(xgb, X_train, X_test, y_train, y_test)

### CatBoost Initial Score 

In [None]:
unscaled_modeling(catboost, X_train, X_test, y_train, y_test)


## Tuning Our Models 

### Hyperparameter tuning with RandomizedSearchCV

In [None]:
# Scaled data 
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
rf_grid = {"n_estimators": np.arange(20, 500, 10),
           "max_depth":np.arange(3, 20),
           "max_features":["auto", "sqrt", "log2"]}

svc_grid = {"kernel":["rbf", "linear", "poly", "sigmoid"], 
            "gamma":[0.001, 0.01, 0.1, 1, 10], 
            "C":[0.001, 0.01, 0.1, 1, 10],}
logis_grid = {"C":[0.01, 0.1, 1, 10],
              "penalty":["l2", "l1"],
              "solver":["lbfgs", "liblinear", "newton-cg"]}

def tuning(model, grid, cv, X_train, y_train, X_test, y_test):
    """
    Hyperparameter tuning using RandomizedSearchCV 
    model: the estimator you will be using 
    grid: the parameters you will tune 
    cv: folds for cross-validation
    X_train: training set with our features
    y_train: training set with our labels 
    X_test: test set with our features 
    y_test: test set withour labels 
    """
    np.random.seed(42)
    clf = RandomizedSearchCV(model,
                             param_distributions=grid,
                             cv=cv,
                             n_iter = 20,
                             verbose=True)
    clf.fit(X_train, y_train)
    return clf.score(X_test, y_test) * 100, clf.best_params_

In [None]:
# RandomForest 
tuning(rf, rf_grid, 5, X_train, y_train, X_test, y_test)

In [None]:
# SVC
tuning(svc, svc_grid, 5, X_train_scaled, y_train, X_test_scaled, y_test)

In [None]:
# Logistic Regression
tuning(logistic_reg, logis_grid, 5, X_train, y_train, X_test, y_test)

As we can see the logistic regression that has a lower score in our initial training earlier has now the highest accuracy.

## Evaluating the models

For another evaluation metrics, we will be using confusion matrix, classification report, and cross validation 

In [None]:
def plot_conf_mat(y_test, y_preds):
    """
    Plots a confusion matrix using Seaborn's heatmap().
    """
    fig, ax = plt.subplots(figsize=(10, 5))
    ax = sns.heatmap(confusion_matrix(y_test, y_preds),
                     annot=True, # Annotate the boxes
                     cbar=False,
                     cmap="icefire",
                     fmt= "d")
    plt.xlabel("True label")
    plt.ylabel("Predicted label")
    
def class_report(y_test, y_preds): 
    """
    Printing the classification report 
    y_test: labels 
    y_preds: prediction of the model 
    """
    print(classification_report(y_test, y_preds))
    
def cross_val(model, X, y): 
    """
    Cross validate our model
    """
    np.random.seed(42)
    cv_score = np.mean(cross_val_score(model,
                               X,
                               y,
                               cv = 5,
                               error_score="raise"))
    return cv_score

In [None]:
# Models
svc_tuned = SVC(kernel="linear", C=1, gamma=1)
log_reg_tuned = LogisticRegression(solver="newton-cg", penalty="l2", C=0.1)

In [None]:
# Fitting the data in fine tuned model
svc_tuned.fit(X_train_scaled, y_train)
log_reg_tuned.fit(X_train, y_train)

In [None]:
# Fitting the model
svc_y_preds = svc_tuned.predict(X_test_scaled)
log_y_preds =log_reg_tuned.predict(X_test)

In [None]:
confusion_matrix(y_test, svc_y_preds)

In [None]:
plot_conf_mat(y_test, svc_y_preds)

In [None]:
confusion_matrix(y_test, log_y_preds)

In [None]:
plot_conf_mat(y_test, log_y_preds)

In [None]:
class_report(y_test, svc_y_preds)

In [None]:
class_report(y_test, log_y_preds)

In [None]:
precision_score(y_test, log_y_preds, average="weighted")

In [None]:
cross_val(log_reg_tuned, X, y)

In [None]:
X_scaled = scaler.fit_transform(X)

In [None]:
cross_val(svc_tuned,X_scaled, y)

## Final Prediction 

For our final prediction we will use the logistic regression model, but you can use also the SVC model.

In [None]:
final_preds =log_reg_tuned.predict(mobile_test)
final_preds

In [None]:
mobile_test["price_range"] = final_preds
mobile_test.head()

In [None]:
mobile_test.tail()