<h1 style='text-align:center; background: lightgreen; font-size: 3vw'>Top 250 Restaurants  AutoEDA & Feature Engineering</h1>

In this notebook, I am going to perform exploratory data analysis using two AutoEDA libraries that have recently caught my attention:
1. `pandas-profiling`
2. `dataprep`

Then, I am going to perform some feature engineering to see if we are able to produce interesting features that may be used for modelling!

However, before we actually begin our EDA and feature engineering, it is important to understand the origin of the data and the context of the data.
If you want to do so, here are the links explaining the context and data collection process of the 3 datasets:

1. Future 50: https://www.restaurantbusinessonline.com/future-50-2020
2. Top 250: https://www.restaurantbusinessonline.com/top-500-2020
3. Independence 100: https://www.restaurantbusinessonline.com/top-100-independents-2020


<h2> Here is a quick overview of the context of each dataset: </h2>

1. Future 50: a measure of the fastest-growing restaurant concepts with annual sales between 20 million and 50 million dollars a year.
2. Top 250: a measure of the largest restaurant concepts by U.S. systemwide sales, based on results from the 2019 calendar year.
3. independence 100: a measure of the highest-grossing independent restaurants. Only restaurant concepts with no more than five locations are
considered  "independents” for the purpose of this list 

In [None]:
!pip install dataprep

In [None]:
import numpy as np 
import pandas as pd 
from pandas_profiling import ProfileReport
from dataprep.eda import plot,plot_correlation
from scipy.stats import pearsonr,spearmanr

In [None]:
top_250 = pd.read_csv('/kaggle/input/restaurant-business-rankings-2020/Top250.csv')

<h1>EDA Process</h1>

1. Use `pandas_profiling` to get a general overview of the data, and to look for features that may prove interesting for exploration

2. Focus in on key features identified in the first step to gather interestings insights using `dataprep.eda`


<h1 style='text-align:center; background: lightgreen; font-size: 2vw'>Top 250 - EDA</h1>

Before we actually being our EDA process, it's important to get a "feel" for the dataset. Let's view the schema for the dataset to do so:

    Rank: Position in ranking
    Restaurant: Name of restaurant
    Content: Description, only for certain restaurants
    Sales: in 2019 (in million dollars)
    YOY_sales: Year on year sales increase in %
    Units: Number of premises in US
    YOY_units: Year on year premises increase in %
    Headquarters: Place of the restaurant's headquarters
    Segement_Category: Menu type and / or industry segment





Now, let's get a general overview of the dataset using the `ProfileReport` class from `pandas_profiling`

In [None]:
ProfileReport(top_250)

Ok, so `pandas-profiling` has given us some really interesting insights about the data! While there is a lot, and I do recommend you to look over them yourselves, Here are the key points:

1. **We have 9 features, with 6 being categorical and 3 being numeric.**

   **Insight**: They make require further investigation, possibly using feature engineering tools to encode the values into numeric values that 
   can be interpreted by the model(or just use CatBoost :] )
   

2. **We have missing values in the Content and Headquarters features**


3. **We have some strong linear(and non-linear) correlations**

   In the interactions section, there seems to be high **linear** correlation between the Sales and Units features, according to the Pearson correlation coefficient.
   However, note that just because something does not have a high Pearson correlation coefficient, it does not mean that it is not correlated. How so?
>    change the filter from Pearson's R to Spearman's P

   If you have 2 numeric features that are not linearly correlated, and if one(or both) of your features are ordinal features(E.g ranking, hierarchical class),
   Then you can measure the strength and relationship between them using a correlation statistic.   
   The most common is Spearman's Rank,which considers the ranks of the values for the two variables
   
   Spearman’s correlation is equivalent to calculating the Pearson correlation coefficient on the ranked data. So ρ will always be a value between -1 and 1. The further away ρ is from zero, the stronger the relationship between the two variables. The sign of ρ corresponds to the direction of the relationship. If it is positive, then as one variable increases, the other tends to increase. If it is negative, then as one variable increases, the other tends to decrease.

 You might want to use Spearman’s correlation if your data have a non-linear relationship (like an exponential relationship) or you have one or more outliers. However,  Spearman’s correlation is only appropriate if the relationship between your variables is monotonic, meaning that as one variable increases, the other tends to either increase or decrease (not both)
 
![](http://sites.utexas.edu/sos/files/2017/06/Monotonic_both.png) 

Source: [UT Austin](http://sites.utexas.edu/sos/guided/inferential/numeric/bivariate/rankcor/)

**Why does Spearman's correlation coefficient work with our data?**

1. We have a non-linear relationship between Rank and Sales
2. We have one ordinal feature(Rank)
3. We have a monotonic relationship between 2 features(as Rank increases, Sales increases)

# Comparing correlations(Do not always rely on the Pearson correlation coefficient!)

First, let's plot `Rank` against `Sales`

In [None]:
plot(top_250,x='Sales',y='Rank')

Now let's compare the different coefficients:

In [None]:
pearson_correlation,_ = pearsonr(top_250['Sales'],top_250['Rank'])
print('The Pearson correlation coefficient is ' + str(pearson_correlation))

In [None]:
spearman_correlation,_ = spearmanr(top_250['Sales'],top_250['Rank'])
print('The Spearman correlation coefficient is ' + str(spearman_correlation))

So now it is clear to see that you should not drop a feature just because it has a low Pearson correlation.
Always make sure to check the Spearman coefficient, but only when:
1. There is a non-linear relationship between at 2 features, and 1 feature is an ordinal feature
2. The data is non-parametric(not assumed to come from prescribed models that are determined by a small number of parameters)

Remeber than Spearman considers the ranks of the features, so essentially it is calculating the Pearson coefficient on ranked data.

# Now that we have done correlation 101, it's time to get dirty with data!

First, I am going to drop the `Headquarters` and `Content` features, as they simply have too many missing values to be of any real use

In [None]:
top_250 = top_250.drop(columns=['Headquarters','Content'])

I am going to encode `YOY_Sales` and `YOY_Units` so that they are numerical in order to be able to discover insights about them

In [None]:
top_250['YOY_Sales'] = top_250['YOY_Sales'].apply(lambda x: x.strip('%')).astype(np.float64)
top_250['YOY_Units'] = top_250['YOY_Units'].apply(lambda x: x.strip('%')).astype(np.float64)

Now, let's call `the plot_correlation` method of from `dataprep.eda` to gain some valuable insights:

In [None]:
plot_correlation(top_250)

So we can see that `YOY_Sales` and `YOY_Units` are strongly correlated. Let's further examine this correlation:

In [None]:
plot(top_250,x='YOY_Sales',y='YOY_Units')

So here we can see a clear linear relationship between the two features. This could potentially be a useful feature for the modelling process

**What can we confidently say:**
1. There is a clear linear correlation between YOY_Sales and YOY_Units, so the more sales a business makes, the more it can expand.

# Analysing the categorical feature: Segment Category

Let's analyse the `Segment Category` feature and see if we can gain useful insights from it as well:

In [None]:
plot(top_250,"Segment_Category")

Let's break down our observations:

1. We can see that the most popular segment category is a varied menu, and mexican being the 2nd most frequent
2. From the word cloud, we see that the words "casual","service","dining" and "quick" are the most frequent works found in the segment category description. This shows us that most business have a common target of providing customers with a quick and casual service.
3. Most restaurants used casual in their segment category description; also, not how "quick" and "service" have equal word frequencies; they probably are seen in a sentence together. Let us verify this: 

In [None]:
top_250.loc[0]

Our assumption holds true!

# Extended Analysis of numerical features

Let's begin by exploring the sales feature in more depth:

In [None]:
plot(top_250,'Sales')

Let's similarly break down our analysis:
1. Firstly, this feature is left skewed. This should be accounted for when modelling. Possibly a log/sqrt transformation may be required.
2. The Normal q-q plot is a plot that helps us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential. Essentially, it helps us determine of data came from a normal distrubution or not, and shows us if our data is skewed. So we can say that the normal distribution is not present in our feature and it is indeed a left tailed distribution with a heavy tail.

To read more, follow [this link](https://data.library.virginia.edu/understanding-q-q-plots/)

In [None]:
plot(top_250,'Units')

Similar to Sales, we witness a non-normal and left-tailed distribution for Units

# Conclusion

After our analysis, let's take away some key insights:

1. We have linear relationships between key features, as well as non-linear relationships as well.
2. We have data of different scales, so we should rescale features if using linear models.
3. We should also transform a few features using log transformations and other transformations to try transform our data into a normal distribution.
4. We should also encode the Segment category to see if it positiely benefits our model