# Mobile Price Prediction

On the basis of mobile specification data we are going to predict the price range of the device.

**USE**:

- This kind of prediction will help companies estimate price of mobiles to give tough competion to other mobile manufacturer
- Also it will be usefull for Consumers to verify that they are paying best price for a mobile.

#### Dataset Description
- id: ID
- battery_power: Total energy a battery can store in one time measured in mAh
- blue: Has bluetooth or not
- clock_speed: speed at which microprocessor executes instructions
- dual_sim: Has dual sim support or not
- fc: Front Camera mega pixels
- four_g: Has 4G or not
- int_memory: Internal Memory in Gigabytes
- m_dep: Mobile Depth in cm
- mobile_wt: Weight of mobile phone
- n_cores: Number of cores of processor
- pc: Primary Camera mega pixels
- px_height: Pixel Resolution Height
- px_width: Pixel Resolution Width
- ram: Random Access Memory in Megabytes
- sc_h: Screen Height of mobile in cm
- sc_w: Screen Width of mobile in cm
- talk_time: longest time that a single battery charge will last when you are
- three_g: Has 3G or not
- touch_screen: Has touch screen or not
- wifi: Has wifi or not
- price_range : phone price range


Here, price_range is our target.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import array

In [None]:
plt.style.use('seaborn')

In [None]:
train=pd.read_csv("https://raw.githubusercontent.com/overtunned/DataScience/main/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/overtunned/DataScience/main/test.csv")

In [None]:
train.head()

In [None]:
train.shape

In [None]:
train.info()

In [None]:
train.describe()

In [None]:
pd.unique(train["price_range"])

## Pre-processing

In [None]:
train.isnull().sum() #checking for null values

In [None]:
sns.countplot(x=train["price_range"]) #checking for imbalance

## EDA

In [None]:
train.corr()

In [None]:
plt.figure(figsize=(20,15))
sns.heatmap(train.corr(),annot = True)

we observe that
* ram has a high positive correlation with price.
* battery power, pixel height and pixel width shows a positive correlation.

further study into the correlated attributes

In [None]:
fig = plt.figure(figsize=(15,30))
for i, col in enumerate(train.columns):
    ax=plt.subplot(7,3,i+1)
    train[col].hist(ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(10,8))
sns.boxenplot(x="price_range",y="battery_power", data=train,ax = ax)

In [None]:
fig, ax = plt.subplots(figsize=(10,8))
sns.boxplot(x=train['price_range'],y=train['px_height'],ax=ax);

In [None]:
fig, ax = plt.subplots(figsize=(10,8))
sns.boxplot(x=train['price_range'],y=train['px_width'],ax=ax);

In [None]:
fig, ax = plt.subplots(figsize=(10,8))
sns.boxplot(x=train['price_range'],y=train['px_width'],ax=ax);

In [None]:
fig, ax = plt.subplots(figsize=(10,8))
sns.boxplot(x=train['price_range'],y=train['ram'],ax=ax, showfliers= False)

Study into other attributes

In [None]:
sns.pointplot(x=train['price_range'],y=train['fc'])

- We can observe that as the price increases the front camera megapixel also increases but the very high cost phones seems to reduce the front camera megapixels.

In [None]:
sns.pointplot(x=train['price_range'],y=train['pc'])

- WE can observe higher megapixel back camera as the price of the device increases

In [None]:
sns.catplot(x="price_range",y="battery_power", data=train, kind="boxen")

- We can observe a increase in the battery capacity of the phones.

In [None]:
sns.catplot(x='price_range',col='three_g',hue ='four_g',data = train, kind ='count')

We can observe that 
* if the phones have 3g then it has 4g
* nearly half the phones have both 3g and 4g 

In [None]:
sns.pointplot(y="int_memory", x="price_range", data=train)

In [None]:
# temp=pd.DataFrame(train[train.columns[:6]])
# temp['price_range']=train['price_range']
# sns.pairplot(temp,hue='price_range')

In [None]:
# train1 = train[~((train['ram'] < 1400) & (train['price_range'] == 2))]
# train1 = train1[~((train1['ram'] > 1650) & (train1['price_range'] == 0))]

## Hypothesis Testing

In [None]:
ram_price=train[[ 'price_range', 'ram',]]
ram_price=ram_price[(ram_price['price_range'] == 1) | (ram_price['price_range']== 2) ]
ram_price

In [None]:
ram_price.hist(by='price_range')

In [None]:
price1=ram_price['ram'] [ram_price['price_range'] == 1]
price1

In [None]:
price2=ram_price['ram'] [ram_price['price_range'] == 2]
price2

In [None]:
price2.hist(histtype='stepfilled', alpha=.5, bins=20)
price1.hist(histtype='stepfilled', alpha=.5, color=sns.desaturate("red", 1))
plt.xlabel('Ram',fontsize=15)
plt.ylabel('Price Range',fontsize=15)
plt.show()

- From the distribution we can observe that the phones that have a price range of 1 is slightly to the left.
- The ram of phones whose price range is 1 seems lower than the phones in the price range 2.

- Hyposthesis ($H_0$): There is no difference in the ram
- Alternate Hypothesis ($H_1$) : There is a significant difference

In [None]:
means_table = ram_price.groupby('price_range').mean()
means_table

In [None]:
def meandiff(df, attr):
    means_table = df.groupby('price_range').mean()
    return (means_table[attr].iloc[0]- means_table[attr].iloc[1])

In [None]:
ob_diff = meandiff(ram_price, 'ram')
ob_diff

We are going to stimulate the null hypothesis by taking random samples from the population.

There are 1000 rows in the population, so drawing 1000 rows without replacement.

In [None]:
shuffled = ram_price.sample(1000,replace = False)
shuffled_ram = shuffled['ram']
original_and_shuffled= ram_price.assign(shuffled_ram=shuffled_ram.values )
difference = meandiff(original_and_shuffled, 'shuffled_ram')
difference

In [None]:
differences = np.zeros(5000)
for i in np.arange(len(differences)):
    shuffled = ram_price.sample(1000,replace = False)
    shuffled_ram = shuffled['ram']
    original_and_shuffled= ram_price.assign(shuffled_ram=shuffled_ram.values )
    difference = meandiff(original_and_shuffled, 'shuffled_ram')
    differences[i] = difference
differences_df = pd.DataFrame(differences)
differences_df

In [None]:
differences_df.hist()
plt.title('Prediction Under Null Hypotheses');
plt.xlabel('Differences between Group Averages',fontsize=15)
plt.ylabel('Units',fontsize=15);
print('Observed Difference:', ob_diff)
plt.axvline(ob_diff, color='red');

We can observe that the distribution is centered around 0. So we have to accept our null hypothesis since there is significant difference between the ram of phones in the price range of 1 and 2.

In [None]:
p_value=np.count_nonzero(differences <= ob_diff)/differences.size

In [None]:
if p_value < 0.05: 
    print(" we are rejecting null hypothesis")
else:
    print("we are accepting null hypothesis")

The p-value is less than 0.05 hence we reject the null hypothesis.

### Student's t-Test
- We are going to take the population ram of the dataset to t-test

In [None]:
from scipy import stats

In [None]:
s_A = train[train['price_range']==1]
s_A=s_A['ram']
s_A.mean()

In [None]:
s_B = train[train['price_range']==2]
s_B=s_B['ram']
s_B.mean()

In [None]:
ttest,p_value=stats.ttest_ind(s_A,s_B)

In [None]:
if p_value < 0.05: 
    print(" we are rejecting null hypothesis")
else:
    print("we are accepting null hypothesis")

## Regression

In [None]:
# separating the attributes into categorical and numerical for scaling and making a new dataframe for prediction.
num=['battery_power',
     'clock_speed',
     'fc',
     'int_memory',
     'mobile_wt', 
     'n_cores',
     'm_dep',
     'pc', 
     'px_height',
     'px_width', 
     'ram', 
     'sc_h', 
     'sc_w', 
     'talk_time']


cat=['blue',
     'dual_sim',
     'four_g',
     'three_g', 
     'touch_screen',
     'wifi']

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score
import statsmodels.api as sm

In [None]:
scaler_train = MinMaxScaler()
train1=train.copy(deep=True)
train1[num]=scaler_train.fit_transform(train1[num])
# X=train1.drop('price_range',axis=1)
X=train1[['battery_power','px_height','px_width','ram']]
y=train1['price_range']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, RepeatedKFold, RepeatedStratifiedKFold, cross_val_score
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, classification_report,auc

In [None]:
linreg=LinearRegression()
linreg.fit(X_train,y_train)
linreg.score(X_train,y_train)

In [None]:
linreg.score(X_test,y_test)

In [None]:
print("The linear model is: Y = {:.3} + {:.3}X".format(linreg.intercept_, linreg.coef_[0]))
print("The linear model intercept is {}".format(linreg.intercept_))
print("The linear model coefficent is {}".format(linreg.coef_))

In [None]:
linreg_coef = pd.Series(index = X_train.columns, data = np.abs(linreg.coef_))
n_features = (linreg_coef>0).sum()
print(f'{n_features} features with reduction of {(1-n_features/len(linreg_coef))*100:2.2f}%')
linreg_coef.sort_values().plot(kind = 'bar', figsize = (13,5));

### Support Vector Regression

In [None]:
from sklearn.svm import SVR

In [None]:
svr = SVR(kernel='linear')
svr.fit(X_train,y_train)
svr.score(X_train,y_train)

In [None]:
svr_coef = pd.Series(index = X_train.columns, data = np.abs(svr.coef_[0]))
n_features = (svr_coef>0).sum()
print(f'{n_features} features with reduction of {(1-n_features/len(svr_coef))*100:2.2f}%')
svr_coef.sort_values().plot(kind = 'bar', figsize = (13,5));

In [None]:
X = X_test
X = sm.add_constant(X)
est=sm.OLS(y_test, X)
est = est.fit()
est.summary()

## Classification

In [None]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import label_binarize

def roc_plot(clf, X, y, fn):
    y1 = label_binarize(y, classes=[0, 1, 2, 3])
    n_classes = y1.shape[1]

    X_train, X_test, y_train, y_test = train_test_split(X, y1, test_size=.5,random_state=0)

    classifier = OneVsRestClassifier(clf).fit(X_train, y_train)
    
    if fn==1:
        y_pred= classifier.decision_function(X_test)
    elif fn==2:
        y_pred= classifier.predict(X_test)

    fpr = dict()
    tpr = dict()
    thrhld=dict()
    roc_auc = dict()

    for i in range(n_classes):
        fpr[i], tpr[i], thrhld[i] = roc_curve(y_test[:, i], y_pred[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])

    plt.figure()
    lw = 2
    fig = plt.figure(figsize=(15,15))
    for i in range(n_classes):
        ax=plt.subplot(2,2,i+1)
        ax.plot(fpr[i], tpr[i], color='darkorange',
                 lw=lw, label='ROC curve (area = %0.6f)' % roc_auc[i])
        ax.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
        ax.set_xlabel('False Positive Rate')
        ax.set_ylabel('True Positive Rate')
        ax.set_title(f'ROC {i}')
        ax.legend(loc="lower right")

In [None]:
def kfoldcv(model,X,y):
    cv = KFold(n_splits=10,shuffle=True, random_state=1)
    n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
    print("{:.3} accuracy with a standard deviation of {:.3}" .format(n_scores.mean(), n_scores.std()))

In [None]:
from sklearn.preprocessing import StandardScaler

std_scaler=StandardScaler()
train2=train.copy(deep=True)
train2[num]=std_scaler.fit_transform(train2[num])
# X=train2.drop('price_range',axis=1)
X=train2[['battery_power','px_height','px_width','ram']]
y=train2['price_range']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lm=LogisticRegression()
lm.fit(X_train,y_train)
lm.score(X_test,y_test)

In [None]:
y_pred=lm.predict(X_test)
confusion_matrix(y_test,y_pred)
sns.heatmap(confusion_matrix(y_test,y_pred), annot=True,fmt='.3g')

In [None]:
kfoldcv(lm,X_train,y_train)

In [None]:
print(classification_report(y_pred, y_test))

In [None]:
roc_plot(lm, X, y, 1)

In [None]:
steps = [('pca', PCA(n_components=3)), ('lm', LogisticRegression())]
model = Pipeline(steps=steps)
model.fit(X_train, y_train)
model.score(X_test,y_test)

In [None]:
y_pred=model.predict(X_test)
confusion_matrix(y_test,y_pred)
sns.heatmap(confusion_matrix(y_test,y_pred), annot=True,fmt='.3g')

In [None]:
kfoldcv(model,X_train,y_train)

In [None]:
print(classification_report(y_pred, y_test))

In [None]:
roc_plot(model, X, y, 1)

In [None]:
y_prob = model.predict_proba(X_test)
roc_auc_ovo = roc_auc_score(y_test, y_prob, multi_class="ovo")

roc_auc_ovr = roc_auc_score(y_test, y_prob, multi_class="ovr")

print("One-vs-One ROC AUC scores:\n{:.6f}"
      .format(roc_auc_ovo))
print()
print("One-vs-Rest ROC AUC scores:\n{:.6f}"
      .format(roc_auc_ovr))

### KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
params_knn = {'n_neighbors' : [3, 5, 7, 9, 11, 13, 15]}

knn = KNeighborsClassifier()
knn_classifier = GridSearchCV(knn, params_knn, cv=10, n_jobs=-1)
knn_classifier.fit(X_train, y_train)

print(f'Optimal neighbors: {knn_classifier.best_params_["n_neighbors"]}')
print(f'Best score: {knn_classifier.best_score_}')

In [None]:
knn_model = KNeighborsClassifier(n_neighbors=knn_classifier.best_params_["n_neighbors"])
knn_model.fit(X_train, y_train)
y_pred=knn_model.predict(X_test)
confusion_matrix(y_test,y_pred)
sns.heatmap(confusion_matrix(y_test,y_pred), annot=True,fmt='.3g')

In [None]:
steps = [('pca', PCA(n_components=3)), 
         ('knn', KNeighborsClassifier(n_neighbors=knn_classifier.best_params_["n_neighbors"]))]
model = Pipeline(steps=steps)
model.fit(X_train, y_train)
y_pred=model.predict(X_test)
confusion_matrix(y_test,y_pred)
sns.heatmap(confusion_matrix(y_test,y_pred), annot=True,fmt='.3g')
model.score(X_test,y_test)

In [None]:
print(classification_report(y_pred, y_test))

In [None]:
roc_plot(knn_model, X, y, 2)

## Support Vector Classifier

In [None]:
from sklearn.svm import SVC

In [None]:
svm_clf = SVC(kernel='rbf', C=1).fit(X_train, y_train)
svm_clf.score(X_test, y_test)

In [None]:
y_pred=svm_clf.predict(X_test)
confusion_matrix(y_test,y_pred)
sns.heatmap(confusion_matrix(y_test,y_pred), annot=True,fmt='.3g')

In [None]:
print(classification_report(y_pred, y_test))

In [None]:
roc_plot(svm_clf, X, y, 1)

### Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
nbg_model = GaussianNB()
nbg_model.fit(X_train, y_train)
nbg_model.score(X_test, y_test)

In [None]:
y_pred = nbg_model.predict(X_test)
confusion_matrix(y_test,y_pred)
sns.heatmap(confusion_matrix(y_test,y_pred), annot=True,fmt='.3g')

In [None]:
print(classification_report(y_pred, y_test))

In [None]:
roc_plot(nbg_model, X, y,2)