<h1 style="background-color:SlateBlue; color:white;padding:15px"> Outline </h1>

* <a href="#Package"> Package imports </a>
* <a href="#quality"> Quick quality check </a>
* <a href="#pipline "> Building a pipline function </a>
* <a href="#Comparing"> Comparing the performance of each model </a>

<p style="background-color:SlateBlue; color:white;padding:15px" id="Package"> Package imports </p>

In [None]:
#Data Wrangling
import numpy as np
import pandas as pd

#Data Visualization
import matplotlib.pyplot as plt 
import seaborn as sns
sns.set()

#Model selection
from sklearn.preprocessing   import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model    import LinearRegression
from sklearn.neighbors       import KNeighborsRegressor
from sklearn.tree            import DecisionTreeRegressor
from sklearn.ensemble        import RandomForestRegressor
from sklearn.ensemble        import GradientBoostingRegressor

#Model evaluation
from sklearn.metrics         import mean_squared_error

<p style="background-color:SlateBlue; color:white;padding:15px" id='quality'> Quick quality check </p>

In [None]:
path = '../input/pizza-price-prediction/pizza_v1.csv'
df = pd.read_csv(path)
df.head()

In [None]:
df.info()

In [None]:
# clean the col price_rupiah
df['price_rupiah'] = df['price_rupiah'].apply(lambda x:x.replace('Rp', '').replace(',',''))

In [None]:
df.head()

In [None]:
# convert the type to float
df["price_rupiah"] = df["price_rupiah"].astype(float)

In [None]:
df.info()

In [None]:
# Create a function that print the Value Counts for the columns that the type is object
def value_counts(data):
    for i in data.columns:
        if df[i].dtypes == 'object':
            print(data[i].value_counts())
            print()
            print('The Number of Unique Values are : {}'.format(df[i].nunique()))
            print()
            print('-----------------------------------------------------------')
            print()

In [None]:
value_counts(df)

In [None]:
plt.figure(figsize = (10,6))
sns.swarmplot(x = 'size', y = 'price_rupiah', data = df)
plt.show()

In [None]:
sauce_cheese_map = {'no' : 0,
                     'yes': 1}
size_map = {
    'small'   : 0,
    'medium'  : 1,
    'large'   : 2,
    'XL'      : 3,
    'reguler' : 4,
    'jumbo'   : 5 
}

In [None]:
df['extra_sauce'] = df['extra_sauce'].map(sauce_cheese_map)
df['extra_sauce'] = df['extra_sauce'].astype(float)

df['extra_cheese'] = df['extra_cheese'].map(sauce_cheese_map)
df['extra_cheese'] = df['extra_cheese'].astype(float)

df['size'] = df['size'].map(size_map)
df['size'] = df['size'].astype(float)

In [None]:
# Make a label Encoder for the rest of columns
labencoder = LabelEncoder()

def label_encoder(data):
    for i in data.columns:
        if data[i].dtype == 'object':
            labencoder.fit(list(data[i].values))
            data[i] = labencoder.transform(data[i].values)
            
            # Change the data type to float
            for i in data.columns:
                if data[i].dtype == 'int':
                    data[i] = data[i].astype(float)

In [None]:
label_encoder(df)

In [None]:
df.info()

In [None]:
df.head()

In [None]:
# Make a correlation data to knowing Value Strength and Direction of Linear Relationship
correlation = df.corr()

# Constructing a heatmap to understand the correlation
plt.figure(figsize=(15, 8))
sns.heatmap(correlation, cbar=True, square=True, fmt='.1f', annot=True, annot_kws={'size': 8}, cmap='YlGnBu')
plt.show()

> The correlation between **Price** and **diameter** is **( 0.8 )** 

> The correlation between **Price** and **size**     is **( 0.8 )**

> The correlation between **Price** and **company**  is **( -0.3 )** 

<p style="background-color:SlateBlue; color:white;padding:15px" id='pipline'> Building a pipline function </p>

In [None]:
X = df.drop(['price_rupiah'], axis = 1)
y = df['price_rupiah']

In [None]:
X.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size = 0.2,
                                                    random_state = 123,
                                                    shuffle=True)

In [None]:
# Create a function to Run the model
def run_model(model,X_train, X_test, y_train, y_test):
    
    # Fit Model
    model.fit(X_train,y_train)
    
    # Get Metrics
    y_pred = model.predict(X_test)
    
    train_accuracy = model.score(X_train,y_train)
    test_accuracy = model.score(X_test,y_test)
    rmse = np.sqrt(mean_squared_error(y_test,y_pred))
    
    print(f'Training accuracy   is : {train_accuracy}')
    print()
    print(f'Test accuracy       is : {test_accuracy}')
    print()
    print(f'RMSE                is : {rmse}')

<p style="background-color:SlateBlue; color:white;padding:15px" id='Comparing'> Comparing the performance of each model </p>

**1. Linear Regression**

In [None]:
lr = LinearRegression()
run_model(lr, X_train, X_test, y_train, y_test)

> The accuracy is very **bad** that the accuracy is **74.8%** 


**2. KNN Regressor**

In [None]:
k_values = [1,5,10]

for n in k_values:
    model = KNeighborsRegressor(n_neighbors=n)
    run_model(model, X_train, X_test, y_train, y_test)
    print()
    print('The Number of neighbors is : {}'.format(n))
    print()
    print('--------------------------------')
    print()

> when The Number of neighbors is : **5** , that is the best accuracy of the model 

> The accuracy is **85%**

**3. Decision Tree Regressor**

In [None]:
model = DecisionTreeRegressor()
run_model(model, X_train, X_test, y_train, y_test)

> The accuracy of the **training** is **100%** and The accuracy of the **Test** is **81.3%**

> There is **overfitting** in this model

**4. Random Forest Regressor**

In [None]:
trees = [10,50,100,200, 500]
for n in trees:
    model = RandomForestRegressor(n_estimators=n)
    run_model(model, X_train, X_test, y_train, y_test)
    print()
    print('The Number of estimators is : {}'.format(n))
    print()
    print('--------------------------------')
    print()

> when The Number of estimators is : **200** , that is the best accuracy of the model 

> The accuracy is **88.7%**

> 

**5. Gradient Boosting**

In [None]:
model = GradientBoostingRegressor()
run_model(model, X_train, X_test, y_train, y_test)

> The best model that the accuracy is **90.6%**