# DS-SF-34 | 02 | The `pandas` Library | Assignment | Starter Code

## The Bistro Meets `pandas`

You've just told one of your friend that you are taking a Data Science class.  (Yeah!)  Your friend is running a bistro, a small restaurant, serving moderately priced simple meals in a modest setting ([Wikipedia](https://en.wikipedia.org/wiki/Bistro)).  She collected over some period of time the following information of her patrons' visits.

| Variable's name | Its meaning |
|:---:|:---|
| `name` | Patron's first name |
| `gender` | Patron's gender |
| `is_smoker` | Whether the patron is smoking or not |
| `party` | Party's size |
| `check` | Check amount (\$) (after taxes but before tips) |
| `tip` | Tip (\$) that the patron added to the check |
| `day` | Week day of the visit |
| `time` | Rough time estimate of the visit |

In this assignment, we will be exploring this dataset using `pandas`.<sup>(*)</sup>

<sup>(*)</sup> this dataset was adapted from the `tips` dataset of the `seaborn` package (https://github.com/mwaskom/seaborn-data)

> ### Question 1.  Import `numpy` (as `np`) and `pandas` (as `pd`).

In [1]:
import os

# TODO
import numpy as np
import pandas as pd

#pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

> ### Question 2.  Read the `dataset-02-tips.csv` dataset.

In [2]:
# TODO
df = pd.read_csv('../datasets/dataset-02-bistro.csv')

> ### Question 3.  What is the class of the `pandas` object storing the dataset?

In [3]:
# TODO
type(df)

pandas.core.frame.DataFrame

Answer: DataFrame

> ### Question 4.  How many samples (i.e., rows) are in this dataset?

In [4]:
# TODO
len(df)

244

Answer: 244

> ### Question 5.  How many variables (i.e., columns) are in this dataset?

In [5]:
# TODO
df.shape[1]

8

Answer: 8

> ### Question 6.  Print the name of each column in the dataset, one name per line.

In [6]:
# TODO
df.columns

Index([u'day', u'time', u'name', u'gender', u'is_smoker', u'party', u'check',
       u'tip'],
      dtype='object')

> ### Question 7.  Print the first two rows of the dataset to the console.  What does the output look like?

In [7]:
# TODO
df.head(2)

Unnamed: 0,day,time,name,gender,is_smoker,party,check,tip
0,Sunday,Dinner,Kimberly,Female,False,2,16.99,1.01
1,Sunday,Dinner,Nicholas,Male,False,3,10.34,1.66


Answer: A 2 row DataFrame

> ### Question 8.  Extract the last 2 rows of the data frame and print them to the console.  What does the output look like?

In [8]:
# TODO
b = df.tail(2)
b

Unnamed: 0,day,time,name,gender,is_smoker,party,check,tip
242,Saturday,Dinner,Jon,Male,False,2,17.82,1.75
243,Thursday,Dinner,Brandi,Female,False,2,18.78,3.0


Answer: A 2 row DataFrame

> ### Question 9.  Does the dataset contain any missing values?

In [9]:
# TODO
df.isnull().sum().sum()

0

Answer: No

> ### Question 10.  What can you say about the `is_smoker` variable?  I.e., will it bring any insights when analyzing the dataset?  What do you want to do with it?  (and do it...)

In [10]:
# TODO
df.is_smoker.unique()
#all customers are non-smookers, should drop the column

array([False], dtype=object)

In [11]:
df = df.drop('is_smoker', axis = 1)
df.columns

Index([u'day', u'time', u'name', u'gender', u'party', u'check', u'tip'], dtype='object')

Answer: dropped the is_smoker column because the only value was False for all observations

> ### Question 11.  For which week days does the dataset has data for?

In [12]:
# TODO
df.day.unique()

array(['Sunday', 'Saturday', 'Thursday', 'Friday'], dtype=object)

Answer: Thursday, Friday, Saturday, Sunday

> ### Question 12.  How often was the bistro patronized for each week day?

(check `.value_counts()`; it could come in handy)

(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)

In [13]:
# TODO
df.day.value_counts()

Saturday    87
Sunday      76
Thursday    62
Friday      19
Name: day, dtype: int64

Answer: see above

> ### Question 13.  How much tip did waiters collect for each week day?

In [14]:
# TODO
df[['day','tip']].groupby('day').sum()

Unnamed: 0_level_0,tip
day,Unnamed: 1_level_1
Friday,51.96
Saturday,260.4
Sunday,247.39
Thursday,171.83


Answer: see above

> ### Question 14.  What is the average tip per check (in absolute \$) for each week day?

In [15]:
# TODO
x = abs(df['tip'].mean())
print "$%s" % x

$2.99827868852


Answer: $3

> ### Question 15.  What is the average tip per check (as a percentage of the check) for each week day?

In [16]:
# TODO
y = []
y = (df['tip']) / (df['check'])
print round(y.mean() * 100,2)

16.08


Answer: 16.1%

> ### Question 16.  Are there any name in common between male and female patrons?  (E.g., `Chris` can refer to either a man or a woman)

(check `numpy.intersect1d()`; it could come in handy)

(https://docs.scipy.org/doc/numpy/reference/generated/numpy.intersect1d.html)

In [17]:
df.gender.unique()

array(['Female', 'Male'], dtype=object)

In [18]:
# TODO
m = df[df.gender == 'Male']
f = df[df.gender == 'Female']
np.intersect1d(f.name,m.name)

array(['Casey'], dtype=object)

Answer: Casey

> ### Question 17.  If no patrons share the same name, how many unique patrons are in the dataset?

In [19]:
# TODO
c = df.drop(df.name == 'Casey', axis = 0)
len(c.name.unique())

179

Answer: 179 unique names excluding the 2 Caseys

> ### Question 18.  How many times did `Kevin` patronized the bistro?  How about `Alice`?

In [20]:
# TODO
df[['name']][(df.name == 'Kevin') | (df.name == 'Alice')].groupby('name').size()

name
Alice    2
Kevin    4
dtype: int64

Answer: Alice 2; Kevin 4

> ### Question 19.  Who are the top 3 female and male patrons?

In [52]:
# TODO
df_1 = df.groupby(['gender','name']).size().reset_index()
df_1.columns = ['gender','name','visits']
print df_1[df_1.gender == "Male"].sort_values('visits',ascending=False).head(3)
print df_1[df_1.gender == "Female"].sort_values('visits',ascending=False).head(3)
#x = df_1.sort('visits').groupby(['gender','name'])

#.sort_values('gender', axis=0, ascending=True)

    gender   name  visits
100   Male  David       8
120   Male  James       5
89    Male  Casey       5
    gender   name  visits
45  Female   Mary       4
36  Female  Laura       3
8   Female  Casey       3


Answer: Female: Mary, Laura, and Casey; Male: David, James, Casey

> ### Question 20.  Who's the best tipper (as a fraction of all tips over all check totals)?  Who's the worst?  How many times did they patronize the bistro?

(check `numpy.intersect1d()`; it could come in handy)

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort_values.html)

In [84]:
#df['visits'] = df.groupby('name').size()
g = df[['name','check','tip','visits']].groupby(['name']).sum().reset_index()
g['tip_perc'] = g.tip/g.check
print g.sort_values('tip_perc', ascending=False).head(1)
print g.sort_values('tip_perc', ascending=True).head(1)

        name  check  tip  visits  tip_perc
122  Maryann    9.6    4     NaN  0.416667
      name  check   tip  visits  tip_perc
80  Jeremy  32.83  1.17     NaN  0.035638


Answer: Maryann is best with a 42% tip; Jeremy is worst with 4%.