# Exploring Restaurants on Yellow Pages

Yellow pages is a long-running directory for services that has been published in one form or another since the late 19th century. These phone directories are stereotypically published in a large book filled with thin yellow sheets of paper, hence the name.

Each page of which would list hundreds of phone numbers for lawyers, doctors, dentists, handmen, and ordinary people. This dataset is a subset of yellow pages listings for restuarants in the United States. It is taken from The Real Yellow Pages, a phone directory publisher (one of several using the name "Yellow Pages") formerly owned by AT&T. Todays these services have fallen to the wayside in favor of applications like Yelp!, but phone directories nevertheless march on.

In this notebook we will explore the Yellow Pages restaurants dataset. We'll look at what categories of restaurants are popular in the United States and, briefly, what restaurant domains on the web look like. We will probe the basic dataset attributes and hopefully uncover some interesting effects from the data! This exploratory data analytics notebook is especially recommended for beginners. Feel free to fork this notebook and/or copy the code here and explore further on your own!

# Data Munging

In [None]:
import pandas as pd
restaurants = pd.read_csv("../input/yellowpages_com-restaurant_sample (1).csv")

In [None]:
restaurants.head(3)

The `Categories` fiend in this dataset concatenates a wealth of different restaurant categories into a comma-separated list. We should "unroll" this data into a set of Yes/No `bool` fields so that we can manipulate restaurant categories more easily..

In [None]:
import numpy as np
import itertools

def catmap(cats):
    if pd.isnull(cats):
        return [np.nan]
    else:
        return [cat.strip() for cat in cats.split(",")]

# Generate the by-entry categories list.
cat_lists = restaurants['Categories'].map(catmap)

# Get the set of possible categories.
from itertools import chain
categories = set(list(chain.from_iterable(cat_lists.values.tolist())))

In [None]:
len(categories)

There are a lot of categories! We will give each of these categories its own field in the dataset.

In [None]:
from tqdm import tqdm

cat_lists = restaurants['Categories'].map(catmap)

for cat in tqdm(categories):
    restaurants[cat] = cat_lists.map(lambda cats: cat in cats)

This greatly increases the dimensionality of our dataset.

In [None]:
restaurants.head(3).shape

## What kinds of restaurants are most popular in the United States*

[*] w.r.t. the sample provided, which is biased towards the Eastern Seaboard.

In [None]:
category_counts = restaurants.loc[:, categories].sum().sort_values(ascending=False)
category_counts = category_counts.drop('Restaurants')  # every entry is a restaurant

The twenty most common restaurant food categories is informative. This chart is, approximately, what the average American taste in food looks like.

In [None]:
import seaborn as sns

In [None]:
import matplotlib.pyplot as plt
category_counts.head(20).plot.bar(title='Top 20 Most Common Restaurant Food Categories',
                                  figsize=(14, 7))
plt.gca().set_xticklabels(plt.gca().get_xticklabels(), rotation=45, ha='right', fontsize=14)
pass

Note that categories are *not exclusive*: a coffee shop that also serves breakfast will be given both categories. Shops are most commonly in more than one category, most commonly 2 but averagely more than 3.

In [None]:
restaurants.loc[:, categories - {'Restaurant'}].T.sum().value_counts().sort_index().plot.bar(
    title='Restaurants by Number of Yellowpages Categories Assigned', figsize=(14, 7),
    fontsize=16
)

In [None]:
"Restaurants are in {:.2f} categories on average".format(
    restaurants.loc[:, categories - {'Restaurant'}].T.sum().sum() / len(restaurants)
)

With 3.2 categories on average, and >250 categories overall, the classification is still extremely sparse. We can use the `missingno` package as a tool to visualize *how* sparse:

In [None]:
import missingno as msno
msno.matrix(restaurants.loc[:, categories - {'Restaurants'}].replace(False, np.nan).head(500))

Note that the long streaks in the dataset nullity values are evidence of some kind of artifact in the way that this data was collected by PromptCloud. If we can't explain where the "streakiness" comes from, it may be a concern when it comes time to, say, build an ML model on this data.

We can build a dendrogram to see what categories go together.

In the chart below, food categories that go together often are located close to another. The smaller the distance between the "splits" in the levels on the chart, the more likely restaurants with either of those categorizations are to have both categorizations. So for example, we see that (unsurprisingly) "Chinese Restauraunts" are almost always also "Asian Restaurants".

In [None]:
display_categories = set(category_counts[category_counts > 50].index) - {'Restaurants'}

In [None]:
msno.dendrogram(
    restaurants.loc[:, display_categories].replace(False, np.nan),
    orientation='left',
    figsize=(9, 14)
)

## Where are what kinds of restaurants popular?

Next, let's try and break restaurant categories down by location.

In [None]:
import seaborn as sns
restaurants['State'].value_counts().plot.bar(title='Restaurants by State', figsize=(14, 7), 
                                             fontsize=14)

The number of restaurants sampled by state is weird. These states are all on the Eastern side of the country, for one. For another, we don't have a proportionatly representative sample: the number of restaurants in Georgia is definitely ont one-twentieth of the restaurants in Indiana, for example!

Oftentimes this sort of odd distribution is again evidence of an artifact in the data collection.

Let's press on. Let's take a peek at differences in the kinds of restaurants available in the different states. We'll take the states with the three largest sample sizes in the data, and break things down into percentages: "restaurants per capita" with respect to the population of restaurants, if you will.

In [None]:
indiana_tot = restaurants.query('State == "IN"').loc[:, categories - {'Restaurants'}].sum() / len(restaurants.query('State == "IN"'))
florida_tot = restaurants.query('State == "FL"').loc[:, categories - {'Restaurants'}].sum() / len(restaurants.query('State == "FL"'))
penn_tot = restaurants.query('State == "PA"').loc[:, categories - {'Restaurants'}].sum() / len(restaurants.query('State == "PA"'))

In [None]:
(indiana_tot - florida_tot)[
    # Index by restaurant types with the largest difference per capita
    (indiana_tot - florida_tot).abs().sort_values(ascending=False).head(5).index
].sort_values().plot.bar(
    title='Restaurants More Popular in Indiana than Florida, Largest Differences', 
    figsize=(14, 7), fontsize=14
)

In the chart above, restaurants that are more popular in Indiana than in Florida are positive in number, while restaurants more popular in Florida than Indiana are are negative. Indiana has more fast food restaurants per capita! These differences are fairly small in absolute magnitude however, only making up plus-minus 0.3% of the total number of restaurants.

In [None]:
(penn_tot - indiana_tot)[
    # Index by restaurant types with the largest difference per capita
    (penn_tot - indiana_tot).abs().sort_values(ascending=False).head(5).index
].sort_values().plot.bar(
    title='Restaurants More Popular in Florida than Pennsylvania, Largest Differences', 
    figsize=(14, 7), fontsize=14
)

Pennsylvania, meanwhile, has more bakeries and coffee shops than Indiana does.

## What do restaurant website and email domains look like?

This dataset includes a text field for emails. There are a couple of interesting things we can extract from this.

In [None]:
restaurants['Email'][restaurants['Email'].notnull()].head()

First of all, what URL endings are usually used? Not surprisingly, dot-com wins by a lot.

In [None]:
(restaurants['Email'][restaurants['Email'].notnull()]
     .str.split(".")
     .map(lambda d: d[-1])
     .str.lower()
     .value_counts()
     .plot.bar(figsize=(14, 7), fontsize=14, title='Domain Name Endings Used by Restaurants'))

Second, what starter words (before `@`) are used? Note that some of these, like `ddkidzone`, are specific to specific chains, however (Dunkin' Donuts, in this case). Still, it's interesting to see that `info` is the overall winner, by a lot.

In [None]:
(restaurants['Email'][restaurants['Email'].notnull()]
     .str.split("@")
     .map(lambda d: d[0])
     .str.lower()
     .value_counts()
     .head(10)
     .plot.bar(figsize=(14, 7), fontsize=14, title='Domain Name Endings Used by Restaurants'))

## Further ideas

That's all here, folks!

To explore further, try exploreing the `category` field further.