## Lab Session 

### Learning Objective:
- Working with data using python libaries.
- Data Visualization.
- Exploratory data analysis and data preprocessing.
- Building a Linear regression model to predict the tip amount based on different input features.

### About the dataset (Customer Tip Data)

#### Dataset Source: https://www.kaggle.com/datasets/ranjeetjain3/seaborn-tips-dataset

The dataset contains information about the 244 orders served at a restaurant in the United States. Each observation includes the factors related to the order like total bill, time, the total number of people in a group, gender of the person paying for the order and so on.

#### Attribute Information:

- **total_bill:** Total bill (cost of the meal), including tax, in US dollars
- **tip:** Tip in US dollars
- **sex:** Sex of person paying for the meal
- **smoker:** There is a smoker in a group or not
- **day:** Day on which the order is served
- **time:** Time of the order
- **size:** Size of the group

Food servers’ tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. Restaurant managers need to know which factors matter when they assign tables to food servers. For the sake of staff morale, they usually want to avoid either the substance or the appearance of unfair
treatment of the servers, for whom tips (at least in restaurants in the UnitedStates) are a major component of pay.

### Import required libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

import sklearn
from sklearn.preprocessing import StandardScaler,LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

### Load the dataset

In [None]:
df = pd.read_csv('tips.csv')
df.head()

### 1. Make a list of categorical and numerical columns in the data.

In [None]:
cat = []
num = []
for i in df.columns:
    if df[i].dtypes == 'object':
        cat.append(i)
    else:
        num.append(i)

print('The categorocal variables are:\n', cat,'\n')
print('The numerical variables are:\n', num)

### 2. Compute the average bill amount for each day.

In [None]:
df.groupby('day')['total_bill'].mean()

### 3. Which gender is more generous in giving tips?

In [None]:
df.groupby('sex')['tip'].mean()

### 4. According to the data, were there more customers for dinner or lunch?

In [None]:
df.groupby('time')['size'].count()

### 5. Based on the statistical summary, comment on the variable 'tip'

In [None]:
df['tip'].describe()

### 6. Find the busiest day in terms of the orders?

In [None]:
df.day.mode()

### 7. Is the variable 'total_bill' skewed? If yes, identify the type of skewness. Support your answer with a plot

In [None]:
sns.distplot(df['total_bill'])

### 8. Is the tip amount dependent on the total bill? Visualize the relationship with a appropriate plot and metric and write your findings.

In [None]:
df[['tip','total_bill']].corr()

### 9. What is the percentage of males and females in the dataset? and display it in the plot

In [None]:
df['sex'].value_counts(normalize=True).plot(kind='pie',autopct='%.2f%%')

### 10. Compute the gender-wise count based on smoking habits and display it in the plot

In [None]:
df.groupby('sex')['smoker'].value_counts()

In [None]:
pd.crosstab(df['smoker'],df['sex']).plot(kind='bar')

### 11. Compute the average tip amount given for different days and display it in the plot.

In [None]:
df.groupby('day')['tip'].mean()

In [None]:
sns.barplot(x='day',y='tip',data=df)
plt.show()

### 12. Is the average bill amount dependent on the size of the group? Visualize the relationship using appropriate plot and write your findings.

In [None]:
df.groupby('size')['total_bill'].mean().plot(kind='bar')
plt.show()

### 13. Plot a horizontal boxplot to compare the bill amount based on gender

In [None]:
sns.boxplot(x='total_bill',y='sex',data=df)
plt.show()

### 14. Find the maximum bill amount for lunch and dinner on Saturday and Sunday

In [None]:
df.groupby(['time','day'])['total_bill'].max()

### 15. Compute the percentage of missing values in the dataset.

In [None]:
df.isnull().sum()/len(df)*100

### 16. Is there are any duplicate records in the dataset? If yes compute the count of the duplicate records and drop them.

In [None]:
len(df[df.duplicated()])

In [None]:
## Dropping duplicates.
df.drop_duplicates(inplace=True)

In [None]:
## Recheck
len(df[df.duplicated()])

### 17. Is there are any outliers present in the column 'total_bill'? If yes treat them with transformation approach, and plot a boxplot before and after the treatment

In [None]:
## boxplot before treatment
sns.boxplot(df['total_bill'])
plt.show()

In [None]:
## Treating outliers using the log transformation
df['total_bill_trans'] = np.log(df['total_bill'])

## boxplot after transformation
sns.boxplot(df['total_bill_trans'])
plt.show()

### 18. Is there are any outliers present in the column 'tip'? If yes remove them using IQR techinque.

In [None]:
sns.boxplot(df['tip'])
plt.show()

In [None]:
## Using IQR method
Q1 = df['tip'].quantile(0.25)
Q3 = df['tip'].quantile(0.75)
IQR = Q3-Q1

lower_whisker = Q1-(1.5*IQR)
upper_whisker = Q3+(1.5*IQR)

In [None]:
df_out = df.loc[(df['tip'] < upper_whisker) & (df['tip'] > lower_whisker)] # rows without outliers

In [None]:
sns.boxplot(df_out['tip'])
plt.show()

### 19. Encode the categorical columns in the dataset and print the random 5 samples from the dataframe.

In [None]:
df = pd.get_dummies(df, drop_first=True)
df.sample(5)

### 20. Check the range of the column 'total_bill' and transform the values such that the range will be 1.

In [None]:
tb_max = df['total_bill'].max()
tb_min = df['total_bill'].min()
range_ = tb_max-tb_min
print(range_)

In [None]:
## Initialize minmaxscaler
mm = MinMaxScaler()

In [None]:
## Normalizing the values of the total_bill, so that the range will be 1
df['total_bill_mm'] = mm.fit_transform(df[['total_bill']])

In [None]:
## Checking the range after normalization
tb_mm_max = df['total_bill'].max()
tb_mm_min = df['total_bill'].min()
range_ = tb_mm_max-tb_mm_min
print(range_)

### 21. Load the dataset again by giving the name of the dataframe as "tips_df"
- i) Encode the categorical variables.
- ii) Store the target column (i.e.tip) in the y variable and the rest of the columns in the X variable

In [None]:
## loading the dataset again as 'tips_df'
tips_df = pd.read_csv('tips.csv')
tips_df.head(2)

In [None]:
## Encoding categorical variables
tips_df = pd.get_dummies(tips_df,drop_first=True)
tips_df.head(2)

In [None]:
## Storing the target column in Y variable and the rest of the columns in the X variable
x = tips_df.drop('tip',axis=1)
y = tips_df['tip']

### 22. Split the dataset into two parts (i.e. 70% train and 30% test), and Standardize the columns "total_bill" and "Size" using the mim_max scaling approach

In [None]:
## Split the data
x_train, x_test, y_train, y_test = train_test_spilt(x,y,test_size=0.30)
print(x_train.shape,x_test.shape)
print(y_train.shape,y_test.shape)

## Scaling the data using min max scaling
mm = MinMaxScaler()

x_train.iloc[:,:2] = mm.fit_transform(x_train.iloc[:,:2])
x_test.iloc[:,:2] = mm.transform(x_test.iloc[:,:2])

### 23. Train a linear regression model using the training data and print the r_squared value of the prediction on the test data.

In [None]:
## Fitting a linear regression model on the train data
lr = Linearregression()
lr.fit(x_train,y_train)

In [None]:
## Making prediction on the test data
pred = lr.predict(x_test)

In [None]:
## Computing r2_score
print('r2-score test:', r2_score(y_test,pred))

### Happy Learning:)