# **TASK 1: RESTAURANT TIPS**
**Pranay Srivastava\
ME379M: Data Science for Engineers\
DUE: 01/17/2023**


## Strategy
* Perform linear regression on relevant categories (serving as filters) of a training subset of data using testing examples from the dataset
    * Will work with the dataset as a Pandas dataframe
    * 80/20 train/test split on data using sklearn
    * Find useful filtering categories for linear regression based on trial and error
    * Try getting as accurate as possible given less-refined approach
* Model effectiveness measured by average percent error of predictions over test set.
    * Values are not great (most successful pass was still around 18% incorrect), but its likely that better accuracy will come with a more refined method
* *NOTE: I am not used to using Jupyter notebooks yet. I just wanted to try it out this week as Dr. Iyoob mentioned that we will be submitting with this in the future. I coded everything in Python. I am also not familiar with many data science techniques outside of linear regression and a little bit about neural networks, so this was the best solution I could come up with (as I'm assuming machine learning would be overkill on a dataset this size).*

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from scipy import stats

## Import File
I am importing from my current working directory using Pandas' read_csv() method.

In [None]:
tips = pd.read_csv(os.getcwd() + r'\tips.csv')

## Add Tip Percentage into Dataframe
Rather than using tip values directly, the regression is going to be based off tip percentage as a function of the total bill paid in order to give a common basis for comparision between sizes of tips. People tend to pay differently depending on how large their bill is, and there is more consistency in the trend for percentage values than tip size (especially at higher bill sizes), even if the values themselves are not as relatively close.

In [None]:
percentage = list(map(lambda x, y: x/y, tips['tip'], tips['total_bill']))
tips['percent_tip'] = percentage

# plots for visualization purposes
plt.figure()
plt.scatter(train['total_bill'], train['tip'])
plt.title("Tip Value vs. Total Bill")
plt.xlabel("Total Bill ($)")
plt.ylabel("Tip ($)")

plt.figure()
plt.scatter(train['total_bill'], train['percent_tip'])
plt.title("Percent Tip vs. Total Bill")
plt.xlabel("Total Bill ($)")
plt.ylabel("Percent Tip (%)")
plt.show()

## Filter Outliers
Any outliers which are over 2.5 standard deviations from the mean are removed from the dataset.

In [None]:
z = np.abs(stats.zscore(tips['percent_tip']))
tips = tips.drop(np.where(z>2.5)[0], axis=0)

## Perform 80/20 Train/Test Split
The train dataframe will serve as the source for the data I will fit predictions on. The test data will be the points that I am trying to predict.

In [None]:
train, test = train_test_split(tips, test_size=0.2)

## Testing
I manually changed the categories that were being filtered out based on the input test point in order to see what combinations would be the most accurate. Through trial and error, I found that the most effective combination of filters came when pairing gender with time of day. The loop below goes through the entire test set and yields an average percent error value that served as the basis of my evaluation of the model. I also print a percentage of predictions that are within 5% of the actual tip value as a measure of the number of predictions that were "incredibly" accurate relative to the test data as a whole.

In [None]:
total_error = 0
acc_pred = 0

for i in test.index:
    
    # total_bill and tip_real are going to be used to calculate the predicted tip and measure accuracy of the model
    total_bill = test['total_bill'][i]
    tip_real = test['tip'][i]
    
    # placing attributes to be used as filters in separate directory
    attributes = {
        'sex': test['sex'][i], 
        'smoker': test['smoker'][i], 
        'day': test['day'][i], 
        'time': test['time'][i], 
        'size': test['size'][i]
    }
    
    # applying filters based on attribute dictionary and creating subset of train data to perform regression on
    subset = train
    subset_temp = pd.DataFrame()
    for key in attributes:
        if key in ['sex','time']:
            subset_temp = subset[(subset[key] == attributes[key])]
        if not subset_temp.empty:
            subset = subset_temp
            
    # performance of linear regression using np.polyfit() and np.polyval()
    x, y = subset['total_bill'], subset['percent_tip']
    coeffs = np.polyfit(x,y,1)
    percentage_pred = np.polyval(coeffs, total_bill)
    
    # calculating predicted tip from predicted percentage
    tip_pred = percentage_pred*total_bill
    
    # collecting data on error for specific pass
    percent_error = abs((tip_real-tip_pred)/tip_real)
    print(i, '| Percent Error:', percent_error, '| Real Tip:', tip_real, '| Predicted Tip:', tip_pred)
    
    # adding data on error for full test set
    total_error += percent_error
    if percent_error < 0.05:
        acc_pred += 1
    
# print average error over entire test set
print('\nAverage Error:', total_error/len(test))

# print percentage of "incredibly" accurate predictions (within 5%)
print('Accurate Prediction Rate:', acc_pred/len(test))
