# AICP Internship Task

Tipping waiters for serving food depends on many factors like the type of restaurant, how many
people you are with, how much amount you pay as your bill, etc. Waiter Tips analysis is one of
the popular data science case studies where we need to predict the tips given to a waiter for
serving the food in a restaurant.


Find the Dataset **“tips.csv”**.

The food server of a restaurant recorded data about the tips given to the waiters for serving the
food. The data recorded by the food server is as follows:


1. **total_bill**: Total bill in dollars including taxes
2. **tip**: Tip given to waiters in dollars
3. **sex**: gender of the person paying the bill
4. **smoker**: whether the person smoked or not
5. **day**: day of the week
6. **time**: lunch or dinner
7. **size**: number of people in a table

You can use following libraries: **Numpy**, **Pandas**, **Plotlay**, **sklearn**

## Q.1:

Import data and check null values, check column info and the descriptive statistics of the data.

In [142]:
# inporting pandas for data manipulation
import pandas as pd

In [143]:
# importing excel file
dataset = pd.read_csv("tips.csv")

In [144]:
# checking whether the excel file is imported successfully by printing first file rows of the excel file
dataset.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [145]:
# checking for null values
null_values = dataset.isnull()
print(null_values)

     total_bill    tip    sex  smoker    day   time   size
0         False  False  False   False  False  False  False
1         False  False  False   False  False  False  False
2         False  False  False   False  False  False  False
3         False  False  False   False  False  False  False
4         False  False  False   False  False  False  False
..          ...    ...    ...     ...    ...    ...    ...
239       False  False  False   False  False  False  False
240       False  False  False   False  False  False  False
241       False  False  False   False  False  False  False
242       False  False  False   False  False  False  False
243       False  False  False   False  False  False  False

[244 rows x 7 columns]


In [146]:
# checking for null values and calculating which column has how many null values
null_values.sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

In [147]:
# checking column info i.e datatypes, memory usage etc
column_info = dataset.info()
print(column_info)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB
None


In [148]:
# calculating the statistics of the data provided
statistics = dataset.describe()
print(statistics)

       total_bill         tip        size
count  244.000000  244.000000  244.000000
mean    19.785943    2.998279    2.569672
std      8.902412    1.383638    0.951100
min      3.070000    1.000000    1.000000
25%     13.347500    2.000000    2.000000
50%     17.795000    2.900000    2.000000
75%     24.127500    3.562500    3.000000
max     50.810000   10.000000    6.000000


## Q.2:

Have a look at the tips given to the waiters according to:
> - the total bill paid
> - number of people at a table
> - and the day of the week

In [149]:
# importing library to plot graphs
import plotly.express as px

In [150]:
# generating graph
figure1 = px.scatter(dataset, x = 'total_bill', y = 'tip', size = 'size', color = 'day', trendline = "ols")

In [151]:
figure1.show()        # displaying graph

## Q.3:

Have a look at the tips given to the waiters according to:
> - the total bill paid
> - the number of people at a table
> - and the gender of the person paying the bill

In [152]:
# generating graph
figure2 = px.scatter(dataset, x = 'total_bill', y = 'tip', size = 'size', color = 'sex', trendline = "ols")

In [153]:
figure2.show()          # displaying graph

## Q.4:

Have a look at the tips given to the waiters according to:
> - the total bill paid
> - the number of people at a table
> - and the time of the meal

In [154]:
# generating graph
figure3 = px.scatter(dataset, x = 'total_bill', y = 'tip', size = 'size', color = 'time', trendline = "ols")

In [155]:
figure3.show()  # displaying graph

## Q.5:

Now check the tips given to the waiters according to the days to find out which day the most
tips are given to the waiters:


In [156]:
# generating graph
figure4 = px.pie(dataset, values = 'tip', names = 'day', hole = 0.5)

In [157]:
figure4.show()       # displaying graph

## Q.6:

 look at the number of tips given to waiters by gender of the person paying the bill to see who
tips waiters the most:

In [158]:
# generating graph
figure5 = px.pie(dataset, values = 'tip', names = 'sex', hole = 0.5)

In [159]:
figure5.show()     # displaying graph

## Q.7:

 Now check the tips given to the waiters according to the days to find out which day the most
tips are given to the waiters:

In [160]:
# generating graph
figure6 = px.pie(dataset, values = 'tip', names = 'day', hole = 0.5)

In [161]:
figure6.show()       # displaying graph

## Q.8:

 let’s see if a smoker tips more or a non-smoker:

In [162]:
# generating graph
figure7 = px.pie(dataset, values = 'tip', names = 'smoker', hole = 0.5)

In [163]:
figure7.show()                # displaying graph

## Q.9:

 Now let’s see if most tips are given during lunch or dinner:

In [164]:
# generating graph
figure8 = px.pie(dataset, values = 'tip', names = 'time', hole = 0.5)

In [165]:
figure8.show()             # displaying graph

## Q.10:

Before training a waiter tips prediction model, do some data transformation by
transforming the categorical values into numerical values:


In [171]:
dataset.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,0,0,3,0,2
1,10.34,1.66,1,0,3,0,3
2,21.01,3.5,1,0,3,0,3
3,23.68,3.31,1,0,3,0,2
4,24.59,3.61,0,0,3,0,4


In [167]:
# converting data in column 'sex' from string to boolean i.e in the form of 0 and 1
dataset['sex']. replace(['Female', 'Male'], [0, 1], inplace = True)

In [168]:
# converting data in column 'smoker' from string to boolean i.e in the form of 0 and 1
dataset['smoker'].replace(['No', 'Yes'], [0, 1], inplace = True)

In [169]:
# converting data in column 'day' from string to boolean i.e in the form of 0 and 1
dataset['day'].replace(['Thur', 'Fri', 'Sat', 'Sun'], [0, 1, 2, 3], inplace  = True)

In [170]:
# converting data in column 'time' from string to boolean i.e in the form of 0 and 1
dataset['time'].replace(['Dinner', 'Lunch'], [0, 1], inplace = True)

In [172]:
dataset.tail()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
239,29.03,5.92,1,0,2,0,3
240,27.18,2.0,0,1,2,0,2
241,22.67,2.0,1,1,2,0,2
242,17.82,1.75,1,0,2,0,2
243,18.78,3.0,0,0,0,0,2


## Q.11:

Now split the data into training and test sets. Then train a machine learning model (Linear
Regression) for the task of waiter tips prediction.

In [173]:
# creating x(input) which will act as data to train y(output)
x = dataset.iloc[:, dataset.columns != 'tip']

In [174]:
# data to be input
x.head()

Unnamed: 0,total_bill,sex,smoker,day,time,size
0,16.99,0,0,3,0,2
1,10.34,1,0,3,0,3
2,21.01,1,0,3,0,3
3,23.68,1,0,3,0,2
4,24.59,0,0,3,0,4


In [175]:
# creating y (output) which will be predicted from th input (x)
y = dataset.iloc[: , dataset.columns == 'tip']

In [176]:
y.head()

Unnamed: 0,tip
0,1.01
1,1.66
2,3.5
3,3.31
4,3.61


In [177]:
# importing sklearn to train Linear Regresion Model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [178]:
# 80% training, 20% test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

In [179]:
# passing x_train and y_train into linear Regression to train model
regressor = LinearRegression()
regressor.fit(x_train, y_train)

In [180]:
# predict function is ready to predict 'tip' from the input 'x'
predicted_y_test = regressor.predict(x_test)
predicted_y_train = regressor.predict(x_train)

In [181]:
# coefficient or slope of the generated predicted data's graph
print(f'Coefficient : {regressor.coef_}')

Coefficient : [[ 0.08532528 -0.0227075  -0.00911551  0.0279499   0.1775259   0.20739972]]


In [182]:
# y intercept of the trained data
print(f'Interpret : {regressor.intercept_}')

Interpret : [0.64599566]


In [183]:
# displaying the trained model's ouput data
y_pred = regressor.predict(x)
print(y_pred)

[[2.59432127]
 [2.21160039]
 [3.12202111]
 [3.14243988]
 [3.65759282]
 [3.69461301]
 [1.87023999]
 [3.8302802 ]
 [2.40522948]
 [2.38304491]
 [1.99822791]
 [4.56801353]
 [2.43765309]
 [3.1092816 ]
 [2.41001867]
 [2.9632568 ]
 [2.23345464]
 [2.71928579]
 [2.80001448]
 [3.0633541 ]
 [2.62301638]
 [2.84794479]
 [2.46227453]
 [4.87230928]
 [2.7851344 ]
 [3.02843003]
 [2.23478636]
 [2.17676517]
 [2.94554593]
 [2.79333661]
 [1.9088438 ]
 [3.07450568]
 [2.40169358]
 [3.29687433]
 [2.61107084]
 [3.3543133 ]
 [2.6930424 ]
 [2.76865157]
 [2.89611656]
 [3.96950855]
 [2.67000457]
 [2.61171665]
 [2.31137167]
 [1.94788599]
 [4.13062518]
 [2.68253663]
 [3.01871823]
 [4.30127574]
 [3.7653737 ]
 [2.66120531]
 [2.19191629]
 [2.02264191]
 [4.52961716]
 [1.97007056]
 [3.71765084]
 [2.78492697]
 [4.74288513]
 [3.37013549]
 [2.04392801]
 [5.62743799]
 [2.81612177]
 [2.26321397]
 [2.02515645]
 [3.06027065]
 [2.80225875]
 [3.01471869]
 [2.52029572]
 [1.16212827]
 [2.82011777]
 [2.36560431]
 [2.11959724]
 [2.78

## Q.12:

Check your model prediction .


In [184]:
x.head()

Unnamed: 0,total_bill,sex,smoker,day,time,size
0,16.99,0,0,3,0,2
1,10.34,1,0,3,0,3
2,21.01,1,0,3,0,3
3,23.68,1,0,3,0,2
4,24.59,0,0,3,0,4


In [185]:
# creating new row in excel file that will be input to the model and according to the data, 'tip' will be calculated
x_new = [24.50, 1, 0, 0, 1, 4]

In [186]:
# inputing data into x (input csv file)
x.loc[len(x)] = x_new



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [187]:
x.tail()

Unnamed: 0,total_bill,sex,smoker,day,time,size
240,27.18,0.0,1.0,2.0,0.0,2.0
241,22.67,1.0,1.0,2.0,0.0,2.0
242,17.82,1.0,0.0,2.0,0.0,2.0
243,18.78,0.0,0.0,0.0,0.0,2.0
244,24.5,1.0,0.0,0.0,1.0,4.0


In [188]:
# predicting the output
y_pred = regressor.predict(x)

In [189]:
# printing output. The last row shows the actual value required
print(y_pred)

[[2.59432127]
 [2.21160039]
 [3.12202111]
 [3.14243988]
 [3.65759282]
 [3.69461301]
 [1.87023999]
 [3.8302802 ]
 [2.40522948]
 [2.38304491]
 [1.99822791]
 [4.56801353]
 [2.43765309]
 [3.1092816 ]
 [2.41001867]
 [2.9632568 ]
 [2.23345464]
 [2.71928579]
 [2.80001448]
 [3.0633541 ]
 [2.62301638]
 [2.84794479]
 [2.46227453]
 [4.87230928]
 [2.7851344 ]
 [3.02843003]
 [2.23478636]
 [2.17676517]
 [2.94554593]
 [2.79333661]
 [1.9088438 ]
 [3.07450568]
 [2.40169358]
 [3.29687433]
 [2.61107084]
 [3.3543133 ]
 [2.6930424 ]
 [2.76865157]
 [2.89611656]
 [3.96950855]
 [2.67000457]
 [2.61171665]
 [2.31137167]
 [1.94788599]
 [4.13062518]
 [2.68253663]
 [3.01871823]
 [4.30127574]
 [3.7653737 ]
 [2.66120531]
 [2.19191629]
 [2.02264191]
 [4.52961716]
 [1.97007056]
 [3.71765084]
 [2.78492697]
 [4.74288513]
 [3.37013549]
 [2.04392801]
 [5.62743799]
 [2.81612177]
 [2.26321397]
 [2.02515645]
 [3.06027065]
 [2.80225875]
 [3.01471869]
 [2.52029572]
 [1.16212827]
 [2.82011777]
 [2.36560431]
 [2.11959724]
 [2.78