## Lab Session 

### Learning Objective:
- Working with data using python libaries.
- Data Visualization.
- Exploratory data analysis and data preprocessing.
- Building a Linear regression model to predict the tip amount based on different input features.

### About the dataset (Customer Tip Data)

#### Dataset Source: https://www.kaggle.com/datasets/ranjeetjain3/seaborn-tips-dataset

The dataset contains information about the 244 orders served at a restaurant in the United States. Each observation includes the factors related to the order like total bill, time, the total number of people in a group, gender of the person paying for the order and so on.

#### Attribute Information:

- **total_bill:** Total bill (cost of the meal), including tax, in US dollars
- **tip:** Tip in US dollars
- **sex:** Sex of person paying for the meal
- **smoker:** There is a smoker in a group or not
- **day:** Day on which the order is served
- **time:** Time of the order
- **size:** Size of the group

Food servers’ tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. Restaurant managers need to know which factors matter when they assign tables to food servers. For the sake of staff morale, they usually want to avoid either the substance or the appearance of unfair
treatment of the servers, for whom tips (at least in restaurants in the UnitedStates) are a major component of pay.

### Import required libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score

### Load the dataset

In [None]:
url = "https://www.kaggle.com/datasets/ranjeetjain3/seaborn-tips-dataset"

### 1. Make a list of categorical and numerical columns in the data.

In [None]:
categorical_columns = tips_df.select_dtypes(include='object').columns.tolist()
numerical_columns = tips_df.select_dtypes(include=np.number).columns.tolist()

### 2. Compute the average bill amount for each day.

In [None]:
average_bill_per_day = tips_df.groupby('day')['total_bill'].mean()

### 3. Which gender is more generous in giving tips?

In [None]:
average_tip_by_gender = tips_df.groupby('sex')['tip'].mean()
more_generous_gender = average_tip_by_gender.idxmax()

### 4. According to the data, were there more customers for dinner or lunch?

In [None]:
customer_count_by_time = tips_df['time'].value_counts()
more_customers_time = customer_count_by_time.idxmax()

### 5. Based on the statistical summary, comment on the variable 'tip'

In [None]:
tip_summary = tips_df['tip'].describe()


### 6. Find the busiest day in terms of the orders?

In [None]:
busiest_day = tips_df['day'].value_counts().idxmax()

### 7. Is the variable 'total_bill' skewed? If yes, identify the type of skewness. Support your answer with a plot

In [None]:
sns.histplot(tips_df['total_bill'], kde=True)
plt.title('Distribution of Total Bill')
plt.show()
skewness_total_bill = tips_df['total_bill'].skew()


### 8. Is the tip amount dependent on the total bill? Visualize the relationship with a appropriate plot and metric and write your findings.

In [None]:
sns.scatterplot(x='total_bill', y='tip', data=tips_df)
plt.title('Relationship between Total Bill and Tip')
plt.show()
correlation_total_bill_tip = tips_df['total_bill'].corr(tips_df['tip'])


### 9. What is the percentage of males and females in the dataset? and display it in the plot

In [None]:
gender_percentage = tips_df['sex'].value_counts(normalize=True) * 100
sns.countplot(x='sex', data=tips_df)
plt.title('Gender Distribution')
plt.show()

### 10. Compute the gender-wise count based on smoking habits and display it in the plot

In [None]:
gender_smoker_count = tips_df.groupby(['sex', 'smoker']).size().unstack()
gender_smoker_count.plot(kind='bar', stacked=True)
plt.title('Gender-wise Count based on Smoking Habits')
plt.show()

### 11. Compute the average tip amount given for different days and display it in the plot.

In [None]:
average_tip_by_day = tips_df.groupby('day')['tip'].mean()
average_tip_by_day.plot(kind='bar')
plt.title('Average Tip Amount for Different Days')
plt.show()


### 12. Is the average bill amount dependent on the size of the group? Visualize the relationship using appropriate plot and write your findings.

In [None]:
sns.scatterplot(x='size', y='total_bill', data=tips_df)
plt.title('Relationship between Size and Total Bill')
plt.show()


### 13. Plot a horizontal boxplot to compare the bill amount based on gender

In [None]:
sns.boxplot(x='total_bill', y='sex', data=tips_df, orient='h')
plt.title('Comparison of Bill Amount based on Gender')
plt.show()


### 14. Find the maximum bill amount for lunch and dinner on Saturday and Sunday

In [None]:
max_bill_lunch_saturday_sunday = tips_df.loc[(tips_df['day'].isin(['Saturday', 'Sunday'])) & (tips_df['time'] == 'Lunch'), 'total_bill'].max()
max_bill_dinner_saturday_sunday = tips_df.loc[(tips_df['day'].isin(['Saturday', 'Sunday'])) & (tips_df['time'] == 'Dinner'), 'total_bill'].max()


### 15. Compute the percentage of missing values in the dataset.

In [None]:
missing_percentage = tips_df.isnull().mean() * 100

### 16. Is there are any duplicate records in the dataset? If yes compute the count of the duplicate records and drop them.

In [None]:
duplicate_count = tips_df.duplicated().sum()
tips_df = tips_df.drop_duplicates()

### 17. Is there are any outliers present in the column 'total_bill'? If yes treat them with transformation approach, and plot a boxplot before and after the treatment

In [None]:
sns.boxplot(x='total_bill', data=tips_df)
plt.title('Boxplot of Total Bill before Outlier Treatment')
plt.show()
# Apply transformation, e.g., log transformation
tips_df['total_bill'] = np.log1p(tips_df['total_bill'])
sns.boxplot(x='total_bill', data=tips_df)
plt.title('Boxplot of Total Bill after Outlier Treatment')
plt.show()

### 18. Is there are any outliers present in the column 'tip'? If yes remove them using IQR techinque.

In [None]:
Q1 = tips_df['tip'].quantile(0.25)
Q3 = tips_df['tip'].quantile(0.75)
IQR = Q3 - Q1
tips_df = tips_df[(tips_df['tip'] >= Q1 - 1.5 * IQR) & (tips_df['tip'] <= Q3 + 1.5 * IQR)]


### 19. Encode the categorical columns in the dataset and print the random 5 samples from the dataframe.

In [None]:
encoded_tips_df = pd.get_dummies(tips_df, columns=categorical_columns, drop_first=True)
random_samples = encoded_tips_df.sample(5)


### 20. Check the range of the column 'total_bill' and transform the values such that the range will be 1.

In [None]:
min_total_bill = encoded_tips_df['total_bill'].min()
max_total_bill = encoded_tips_df['total_bill'].max()
encoded_tips_df['total_bill'] = (encoded_tips_df['total_bill'] - min_total_bill) / (max_total_bill - min_total_bill)


### 21. Load the dataset again by giving the name of the dataframe as "tips_df"
- i) Encode the categorical variables.
- ii) Store the target column (i.e.tip) in the y variable and the rest of the columns in the X variable

In [None]:
tips_df = pd.read_csv('tips.csv')
X = pd.get_dummies(tips_df.drop('tip', axis=1), drop_first=True)
y = tips_df['tip']


### 22. Split the dataset into two parts (i.e. 70% train and 30% test), and Standardize the columns "total_bill" and "Size" using the mim_max scaling approach

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = MinMaxScaler()
X_train[['total_bill', 'size']] = scaler.fit_transform(X_train[['total_bill', 'size']])
X_test[['total_bill', 'size']] = scaler.transform(X_test[['total_bill', 'size']])


### 23. Train a linear regression model using the training data and print the r_squared value of the prediction on the test data.

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
r_squared_value = r2_score(y_test, y_pred)
print(f"R-squared value: {r_squared_value}")

# Displaying the plots and results.
plt.show()

### Happy Learning:)