> # **Import Libraries**

Import the module that we want to use for this research.

In [None]:
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
import warnings

> ### **Setup**

Setup the libraries.

In [None]:
warnings.simplefilter('ignore')

%matplotlib inline
%reload_ext autoreload
%autoreload 2

sns.set_style('whitegrid')
sns.set_context('paper', font_scale=1.5)

plt.style.use('fivethirtyeight')

pd.set_option('display.width', 100)
pd.set_option('display.max_rows', 25)
pd.set_option('display.max_columns', 25)

> **note : sorry if my English typing is bad, hopefully you guys can understand.**

> # **Load Dataset**

This dataset is made by me after collecting some information about the price of pizza and many more on internet, and hopefully this dataset useful for you guys, sorry if i had a mistake in spelling some words, thanks in advance.

In [None]:
pizza_data = pd.read_csv('../input/pizza-price-prediction/pizza_v2.csv') # Load dataset

> # **Data Preprocessing**

We must check the data every time we want to make a model, because this is the important thing, if you suddenly meet a bad dataset, wether you want it or not, you must clean the dataset. since we use my dataset, i guarantee the data is 100% clean and have a non null data, and we can just focus in encoding the data since our data is categorical.

In [None]:
pizza_data # Checking the overall data

In [None]:
pizza_data.info() # Getting the information about the data

In [None]:
pizza_data.dtypes.to_frame() # Checking the type

In [None]:
pizza_data['price_rupiah'] = pizza_data['price_rupiah'].str.replace('Rp', '').str.replace(',', '').astype('float64') # Removing Rp 
pizza_data['diameter'] = pizza_data['diameter'].str.replace('inch', '').str.replace(',', '').astype('float64') # Removing Inch

pizza_data.loc[:, ['price_rupiah', 'diameter']] # Checking

In [None]:
pizza_data.dtypes.to_frame() # Checking

In [None]:
pizza_data.T # Checking the structure of our data

> # **Data Visualization**

Data visualization is an interdisciplinary field that deals with the graphic representation of data. It is a particularly efficient way of communicating when the data is numerous as for example a time series.

> ### **1. Extra Sauce Boxplot Visualization**

In [None]:
plt.figure(figsize=(16, 5))
sns.boxplot(x='company', 
                y='price_rupiah', 
                data=pizza_data, 
                hue='extra_sauce')
plt.title('Boxplot Visualization does the Pizza get Extra Sauce or Nah')
plt.ylim(0, 250000)
plt.show()

> ### **2. Extra Cheese Boxplot Visualization**

In [None]:
plt.figure(figsize=(16, 5))
sns.boxplot(x='company', 
                y='price_rupiah', 
                data=pizza_data, 
                hue='extra_cheese')
plt.title('Boxplot Visualization does the Pizza get Extra Cheese or Nah')
plt.ylim(0, 250000)
plt.show()

> ### **3. Extra Mushrooms Boxplot Visualization**

In [None]:
plt.figure(figsize=(16, 5))
sns.boxplot(x='company', 
                y='price_rupiah', 
                data=pizza_data, 
                hue='extra_mushrooms')
plt.title('Boxplot Visualization does the Pizza get Extra Mushrooms or Nah')
plt.ylim(0, 250000)
plt.show()

> ### **4. Pizza Variant Boxplot Visualization**

In [None]:
plt.figure(figsize=(16, 16))
sns.boxplot(x='company', 
                y='price_rupiah', 
                data=pizza_data, 
                hue='variant')
plt.title('Boxplot Visualization Pizza Variant')
plt.ylim(0, 250000)
plt.show()

> ### **5. Pizza Topping Boxplot Visualization**

In [None]:
plt.figure(figsize=(16, 12))
sns.boxplot(x='company', 
                y='price_rupiah', 
                data=pizza_data, 
                hue='topping')
plt.title('Boxplot Visualization Pizza Topping')
plt.ylim(0, 250000)
plt.show()

> ### **6. Pizza Size Boxplot Visualization**

In [None]:
plt.figure(figsize=(16, 12))
sns.boxplot(x='company', 
                y='price_rupiah', 
                data=pizza_data, 
                hue='size')
plt.title('Boxplot Visualization Pizza Size')
plt.ylim(0, 250000)
plt.show()

> ###  **Conclusion :**

Okay, as we can see in the visualization above, there are some outlier.

> ### **1. Company with their Price Visualization**

In [None]:
plt.figure(figsize=(16, 5))
sns.swarmplot(x='company', 
                    y='price_rupiah', 
                    data=pizza_data)
plt.title('Company with their Price')
plt.ylim(0, 260000)
plt.show()

### **2. Company with their Diameter Visualization**

In [None]:
plt.figure(figsize=(16, 5))
sns.swarmplot(x='company', 
                    y='diameter', 
                    data=pizza_data)
plt.title('Company with their Diameter')
plt.ylim(0, 30)
plt.show()

> ### **3. Every Company with Price and Diameter Visualization**

In [None]:
pizza_fg = sns.FacetGrid(pizza_data, col='company', col_wrap=5, height=5)
pizza_fg.map(plt.plot, 'price_rupiah', 'diameter', marker='.')
plt.show()

> ### **Do you understand what the data looks like with the visualization that I applied above?**

if not, feel free to visualize by yourself.

> # **Encoding**

we need to encode the data since this data is a categorical data, and i'm using LabelEncoder here, since this data have a lot of categorical data.

In [None]:
from sklearn.preprocessing import LabelEncoder # Import the Encoder

encoder = LabelEncoder() # Let's gooo!

In [None]:
for i in pizza_data.columns: # Make a for loops
    if pizza_data[i].dtype == 'object': 
        encoder.fit_transform(list(pizza_data[i].values)) # Fit transform
        pizza_data[i] = encoder.transform(pizza_data[i].values) # Transform
         
        for j in pizza_data.columns: # Make a for loops again
            if pizza_data[j].dtype == 'int':
                pizza_data[j] = pizza_data[j].astype('float64') # Change the type

In [None]:
pizza_data.head() # Checking the first 5 rows of data

> # **Splitting, Modeling, Model Evaluation**

In [None]:
from xgboost import XGBRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

> ### **Split Data**

Divide the data and split it using train test split module from sklearn.

In [None]:
X = pizza_data.drop(columns=['price_rupiah']) # Data X
y = pizza_data['price_rupiah'] # Data y

In [None]:
trainX, testX, trainY, testY = train_test_split(X, y,
                                                              test_size=0.3,
                                                              random_state=42) # Split it into train and test data

> ### **Modeling**

Since the data is for predicting numeric or price, from that, we can know, this data is a regression model, then i'm using **XGBRegressor** for that. and
**Pipelines** are a simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

In [None]:
pipe = Pipeline([ # Our Pipeline
    ('scaler', StandardScaler()),
    ('transformer', QuantileTransformer()),
    ('model', XGBRegressor(learning_rate=0.09,
                                       n_estimators=1200,
                                       objective='reg:squarederror',
                                       booster='gbtree'))
])

pipe.fit(trainX, trainY) # Train Data

> ### **Model Evaluation**

In [None]:
from sklearn import metrics
import math

In [None]:
pred_train = pipe.predict(trainX) # Predict Train Data
pred_test = pipe.predict(testX) # Predict Test Data

> ### **Evaluate Train Data**

In [None]:
train_r2_score = metrics.r2_score(trainY, pred_train) # R2_score
print(f'Train R2_score: {train_r2_score}')

train_mse = metrics.mean_squared_error(trainY, pred_train) # MSE Score
print(f'Train MSE : {train_mse}')

train_RMSE = math.sqrt(metrics.mean_squared_error(trainY, pred_train)) # SQRT MSE Score
print(f'Train RMSE : {train_RMSE}')

> ### **Visualization for Actual and Predicted Price in Training Data**

In [None]:
test = pd.DataFrame({'Predicted Price':pred_train, 'Actual Price':trainY})
fig= plt.figure(figsize=(16, 9))
test = test.reset_index()
test = test.drop(['index'],axis=1)
plt.plot(test[:50])
plt.ylim(0, 260000)
plt.legend(['Actual Price','Predicted Price'])
plt.title('Actual & Predicted Price')
plt.show()

> ### **Evaluate Test Data**

In [None]:
test_r2_score = metrics.r2_score(testY, pred_test) # R2_score
print(f'Test R2_score: {test_r2_score}')

test_mse = metrics.mean_squared_error(testY, pred_test) # MSE Score
print(f'Test MSE : {test_mse}')

test_RMSE = math.sqrt(metrics.mean_squared_error(testY, pred_test)) # SQRT MSE Score
print(f'Test RMSE : {test_RMSE}')

> ### **Visualization for Actual and Predicted Price in Testing Data**

In [None]:
test = pd.DataFrame({'Predicted Price':pred_test, 'Actual Price':testY})
fig= plt.figure(figsize=(16, 9))
test = test.reset_index()
test = test.drop(['index'],axis=1)
plt.plot(test[:50])
plt.ylim(0, 260000)
plt.legend(['Actual Price','Predicted Price'])
plt.title('Actual & Predicted Price')
plt.show()

> # **Prediction**

Output actual data with prediction data into csv file.

In [None]:
train_output = pd.DataFrame({
    'Train Actual Price': trainY,
    'Train Predicted Price ': pred_train})

train_output.to_csv('Train Prediction.csv', index=False)

> ## **Train Output Prediction**

In [None]:
train_prediction_output = pd.read_csv('./Train Prediction.csv')
train_prediction_output.head(10)

In [None]:
test_output = pd.DataFrame({
    'Test Actual Price': testY,
    'Test Predicted Price ': pred_test})

test_output.to_csv('Test Prediction.csv', index=False)

> ## **Test Output Prediction**

In [None]:
test_prediction_output = pd.read_csv('./Test Prediction.csv')
test_prediction_output.head(10)

> # **That's it! don't forget to give me feedback and upvote if you like it! thanks in advance!**

## **Here's my another notebook that i made:**

**Data Analysist and Visualization:**

- [World Covid Vaccination](https://www.kaggle.com/knightbearr/data-visualization-world-vaccination-knightbearr)
- [Netflix Time Series Visualization](https://www.kaggle.com/knightbearr/netflix-visualization-time-series-knightbearr)
- [Taiwan Weight Stock Analysist](https://www.kaggle.com/knightbearr/taiwan-weight-stock-index-analysis-knightbearr)

**Regression and Classification:**

- [S&P 500 Companies](https://www.kaggle.com/knightbearr/pricesales-eda-rfr-knightbearr)
- [Credit Card Fraud Detection](https://www.kaggle.com/knightbearr/credit-card-fraud-detection-knightbearr)
- [Car Price V3](https://www.kaggle.com/knightbearr/car-price-v3-xgbregressor-knightbearr)
- [House Price Iran](https://www.kaggle.com/knightbearr/house-price-iran-knightbearr)

**Deep Learning:**

- [Rock Paper Scissors](https://www.kaggle.com/knightbearr/rock-paper-scissors-knightbearr)

**Some Python Code:**

- [Python Cheat Sheet](https://www.kaggle.com/knightbearr/python-cheat-sheet-knightbearr)
- [22 Python Progam](https://www.kaggle.com/knightbearr/22-simple-python-program-knightbearr)