![How%20Much%20is%20that%20table%20going%20to%20tip%20you_.png](attachment:How%20Much%20is%20that%20table%20going%20to%20tip%20you_.png)

# Tips Prediction Algorithm
***
*The concept of a restaurant is fairly new, with the "first real restaurant" opening in the late 1700s. However, the practice of employing waitstaff has quickly brought about one of the most prevalent jobs in America. According to the Bureau of Labor Statistics, there were about 3.2 million servers/bartenders in the United States in 2019, and this number is forecasted to grow as new restaurants crop up every day.*

*Yet, even with this growth in opportunity, almost all establishments lack an accurate estimation of their employees' salaries due to the volatility of the gratuity system. As a result, many service applicants have no choice but to blindly apply to different restaurants, using nothing but the minimum wage as a point of reference for their potential income. This is a poor indicator of salary, as the hourly rate often makes up the minority of a server's revenue. The assumption that higher-end restaurants garner greater tips is generally true, but this still leaves a sizable range from which job-seekers can infer their salary. On the other end, managers are unable to properly advertise their job openings.*

*This also has repercussions on a server's day-to-day responsibilities. One of the most stressful parts of being a server/bartender is picking which tasks to prioritize during a rush. It is difficult to match the disappointment a server feels when they extend their highest level of customer service to a guest and end up with a paltry tip. In this project, I will attempt to create a predictive model that both restaurants and servers can use to anticipate how much a table will tip, and on a grander scale, a server's level of income.*

## 1. Data
***
Kaggle is an online community of data scientists and machine learning practitioners where users are allowed to find and publish datasets. To view the original Kaggle dataset containing tips collected by an individual server at his restaurant, click on the link below:

* [Kaggle Dataset](https://www.kaggle.com/jsphyg/tipping)

## 2. Data Cleaning
***
[Data Wrangling Report](https://github.com/transaint/Springboard-Projects/blob/master/Springboard%20Projects/Predicting%20a%20Table's%20Tips/Data%20Wrangling.ipynb)

* **Problem:** The dataset did not contain a column for tip percentages per table, although it did have a column for tip amounts. For most service workers, the tip percentage is more important than the total tip amount, as a \\$2 tip on a \\$5 bill is much more worthwhile than a \\$2 tip on a \\$50 bill. **Solution:** Added a column for tip percentages called `perc`.

## 3. EDA

[EDA Report](https://github.com/transaint/Springboard-Projects/blob/master/Springboard%20Projects/Predicting%20a%20Table's%20Tips/Exploratory%20Data%20Anslysis.ipynb)

* As expected, there is a strong correlation between the total bill and final tip amount. However, tip percentages seem to decrease as the total bill increases. Perhaps customers are inclined to tip a smaller percentage when the bill comes back high. 

![pairplots.JPG](attachment:pairplots.JPG)

* Tables on certain days tipped significantly higher than tables on other days. This was confirmed with a series of paired t-tests.

![boxplots.JPG](attachment:boxplots.JPG)

## 4. Algorithms and Machine Learning
***
[ML Notebook](https://github.com/transaint/Springboard-Projects/blob/master/Springboard%20Projects/Predicting%20a%20Table's%20Tips/Modeling.ipynb)

I chose to work with Python's [scikit-learn](https://scikit-learn.org/stable/) package for training my predictive model. Using a bootstrapping method to resample 10,000 samples, I tested a linear regression model against a Random Forest regression model for predicting both tips and tip percentages. The Random Forest was far superior for both features. 

| Model | Mean Squared Error |
| :- | :- |
| Linear Regression - tips | 0.72 |
| Random Forest - tips | almost zero |
| Linear Regression - tip percentage | 3.74 |
| Random Forest - tip percentage | almost zero |

**WINNER: Random Forest algorithm**

Using the Grid Search cross validation method, the best hyperparameters for this algorithm are 10 estimators and standard scaling. 

## 5. Evaluations
***
The Random Forest algorithm came out to be so accurate that the correlation between predicted values and actual values in the test set was 1.0, suggesting a 100% accuracy with the data given.

![residuals.JPG](attachment:residuals.JPG)

![residuals2.JPG](attachment:residuals2.JPG)

## 6. Future Improvements
***
* Although my algorithm has a 100% accuracy, its ability to be generalized to other restaurants and servers is questionable. Still, although this exact algorithm might only be applicable to the restaurant where the dataset was collected, it should be encouraging that the information required for an accurate predictive model can be easily quantified and gathered. For example, there are many who believe a server's tips are dependent upon the quality of service they provide, but an accurate algorithm was still possible without acquiring data on customer service quality, which would have been much more difficult to collect.

* This dataset only contained 244 samples. A more reliable and generalizable algorithm could probably be created with a lot more samples.