# Tips Prediction - Pre-Processing and Training Development <a id='#Tips_Prediction_-_Pre-Processing_and_Training_Development'></a>

## 1. Contents <a id='Tips_Prediction_-_Pre-Processing_and_Training_Development'></a>
* [Tips Prediction - Pre-Processing and Training Development](#Tips_Prediction_-_Pre-Processing_and_Training_Development)
  * [1. Contents](#1._Contents)
  * [2. Sourcing and Loading](#2._Sourcing_Loading)
    * [2a. Import relevant libraries](#2a._Import_relevant_libraries)
    * [2b. Load previously wrangled DataFrame](#2b._Load_previously_wrangled_DataFrame)
  * [3. Encoding Categorical Variables](#3._Encoding_Categorical_Variables)
  * [4. Save DataFrame](#4._Save_DataFrame)

## 2. Sourcing and Loading <a id='2._Sourcing_Loading'></a>

### 2a. Import relevant libraries<a id='2a._Import_relevant_libraries'></a>

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import kaggle as kg
import pandas as pd
import missingno as msno
import statsmodels.api as sm
import scipy.stats
from matplotlib.lines import Line2D
from kaggle.api.kaggle_api_extended import KaggleApi
from statsmodels.graphics.api import abline_plot
from sklearn.metrics import mean_squared_error, r2_score, classification_report, confusion_matrix, accuracy_score, \
plot_roc_curve, roc_curve, roc_auc_score, precision_recall_curve, auc, mean_absolute_error
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn import linear_model, preprocessing
from sklearn.feature_selection import SelectKBest, f_regression
from zipfile import ZipFile
from scipy import stats
from scipy.stats import t
from scipy.stats import ttest_ind
from scipy.stats import mannwhitneyu
from numpy.random import seed
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

### 2b. Load previously wrangled DataFrame<a id='2b._Load_previously_wrangled_DataFrame'></a>

In [2]:
tips = pd.read_csv('tips.csv', index_col=0)
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,perc
0,16.99,1.01,Female,No,Sun,Dinner,2,5.944673
1,10.34,1.66,Male,No,Sun,Dinner,3,16.054159
2,21.01,3.5,Male,No,Sun,Dinner,3,16.658734
3,23.68,3.31,Male,No,Sun,Dinner,2,13.978041
4,24.59,3.61,Female,No,Sun,Dinner,4,14.680765


## 3. Encoding Categorical Variables<a id='3._Encoding_Categorical_Variables'></a>

In order to model our data, we will need to encode our categorical variables.

In [3]:
dummy_cols = [column for column in tips.dtypes[tips.dtypes == 'object'].index]
dummy_cols

['sex', 'smoker', 'day', 'time']

In [4]:
tips_dummy = pd.get_dummies(tips[dummy_cols], columns=dummy_cols, drop_first=True)
tips_dummy.head()

Unnamed: 0,sex_Male,smoker_Yes,day_Sat,day_Sun,day_Thur,time_Lunch
0,0,0,0,1,0,0
1,1,0,0,1,0,0
2,1,0,0,1,0,0
3,1,0,0,1,0,0
4,0,0,0,1,0,0


In [5]:
df = tips.merge(tips_dummy, left_index=True, right_index=True)
df.drop(columns=dummy_cols, inplace=True)
df.head()

Unnamed: 0,total_bill,tip,size,perc,sex_Male,smoker_Yes,day_Sat,day_Sun,day_Thur,time_Lunch
0,16.99,1.01,2,5.944673,0,0,0,1,0,0
1,10.34,1.66,3,16.054159,1,0,0,1,0,0
2,21.01,3.5,3,16.658734,1,0,0,1,0,0
3,23.68,3.31,2,13.978041,1,0,0,1,0,0
4,24.59,3.61,4,14.680765,0,0,0,1,0,0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 0 to 243
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   size        244 non-null    int64  
 3   perc        244 non-null    float64
 4   sex_Male    244 non-null    uint8  
 5   smoker_Yes  244 non-null    uint8  
 6   day_Sat     244 non-null    uint8  
 7   day_Sun     244 non-null    uint8  
 8   day_Thur    244 non-null    uint8  
 9   time_Lunch  244 non-null    uint8  
dtypes: float64(3), int64(1), uint8(6)
memory usage: 21.0 KB


## 4. Save DataFrame<a id='4._Save_DataFrame'></a>

In [7]:
df.to_csv('df.csv')