<h1 style="color: #fcd805">Exercise: Exploratory Data Analysis with pandas</h1>

For the `pandas` exercises, you will gradually explore a new dataset of Kickstarter projects.

Kickstarter is a site that lets you crowdfund your project ideas. The dataset shows information about such projects including whether they succeeded or failed.

1. Read the file `kickstarter.csv.gz` from the `data` folder into a `pandas` `DataFrame` and inspect the data with the `.head` method.

Note: the `.gz` ending indicates this is a *zipped* CSV file. This greatly reduces the file size without losing any data, and the file can be read in exactly like a CSV file (no need to do anything about the fact that it's zipped, `pandas` will handle it).

In [1]:
import pandas as pd

kickstarter = pd.read_csv("./data/kickstarter.csv.gz")

kickstarter.head()

Unnamed: 0,ID,name,subcategory,category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0


2. How many rows and columns are there?

In [2]:
kickstarter.shape

(378661, 13)

3. Check the data type of each column. Do any of them look incorrect?

In [3]:
kickstarter.dtypes

ID               int64
name            object
subcategory     object
category        object
currency        object
deadline        object
goal           float64
launched        object
pledged        float64
state           object
backers          int64
country         object
usd pledged    float64
dtype: object

_Looks good apart from string columns being "object" by default and dates not converted automatically_

4. Are there any missing values? If so, what should be done about them?

In [4]:
kickstarter.isnull().sum()

ID                0
name              4
subcategory       0
category          0
currency          0
deadline          0
goal              0
launched          0
pledged           0
state             0
backers           0
country           0
usd pledged    3797
dtype: int64

_4 names missing and many USD conversions of the pledged column missing. Depending on whether our analysis hinges on the name of a project, we could keep/delete those 4 records. The USD pledged one is only missing 1% and, again, dropping it depends on whether the column is important for our analysis_

5. Create a new column to calculate the percentage of the goal that was achieved. This should be the amount pledged as a percentage of the goal.

In [5]:
kickstarter["pct_goal_achieved"] = kickstarter["pledged"] / kickstarter["goal"]

kickstarter.head()

Unnamed: 0,ID,name,subcategory,category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,pct_goal_achieved
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,0.0807
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,0.004889
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,0.0002
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,0.065795


6. Drop the `usd pledged` column as it has some incorrect values in it.

In [6]:
kickstarter = kickstarter.drop(columns=["usd pledged"])

7. Convert the `name` column to the `string` type.

In [7]:
kickstarter["name"] = kickstarter["name"].astype("string")

8. How many categories are there in the data, and what are they?

In [8]:
kickstarter["category"].nunique()

15

In [9]:
kickstarter["category"].unique()

array(['Publishing', 'Film & Video', 'Music', 'Food', 'Design', 'Crafts',
       'Games', 'Comics', 'Fashion', 'Theater', 'Art', 'Photography',
       'Technology', 'Dance', 'Journalism'], dtype=object)

<h1 style="color: #fcd805">Exercise: Filtering and Descriptive Statistics</h1>

We're going to continue working on the Kickstarter data from the previous exercise.

1. How many projects are in the Music category?

In [10]:
music = kickstarter[kickstarter["category"] == "Music"]
len(music)

51918

2. How many projects in the Music category *succeeded*?

In [11]:
music_success = music[music["state"] == "successful"]
len(music_success)

24197

3. How many projects in the Music category contain the word "song"?

In [12]:
music_songs = music[music["name"].str.lower().str.contains("song")]
len(music_songs)

1612

4. How many projects are in the Music and Film & Video categories in total?

In [13]:
music_and_film = kickstarter[(kickstarter["category"] == "Music") | (kickstarter["category"] == "Film & Video")]
len(music_and_film)

115503

_We could also use `.isin()` to see if the category is in a certain list of values_

In [14]:
len(kickstarter[kickstarter["category"].isin(["Music", "Film & Video"])])

115503

5. What are the smallest and biggest goals in the dataset?

In [15]:
kickstarter.sort_values(by="goal", ascending=True).head()

Unnamed: 0,ID,name,subcategory,category,currency,deadline,goal,launched,pledged,state,backers,country,pct_goal_achieved
304489,620302213,LOVELAND Round 6: A Force More Powerful,Conceptual Art,Art,USD,2009-12-04,0.01,2009-11-25 07:54:49,100.0,successful,6,US,10000.0
317771,688564643,"Word-of-mouth publishing: get ""Corruptions"" ou...",Fiction,Publishing,USD,2011-12-13,0.01,2011-11-07 16:46:52,0.0,canceled,0,US,0.0
370401,9572984,Nana,Shorts,Film & Video,USD,2012-03-16,0.15,2012-01-25 07:23:19,0.0,failed,0,US,0.0
226171,219760504,RocknRoll NoisePollution,Documentary,Film & Video,USD,2011-07-19,0.5,2011-07-12 15:59:39,0.0,failed,0,US,0.0
12236,1061341578,???? - Bulgaria Songbook,Music,Music,USD,2015-11-23,1.0,2015-10-22 17:47:51,45.0,successful,4,US,45.0


In [16]:
kickstarter.sort_values(by="goal", ascending=False).head()

Unnamed: 0,ID,name,subcategory,category,currency,deadline,goal,launched,pledged,state,backers,country,pct_goal_achieved
348139,843636303,"The Exodus, one Ark or many.",Documentary,Film & Video,USD,2016-08-16,100000000.0,2016-06-17 17:20:52,14.0,failed,5,US,1.4e-07
367928,944541075,Hydroponic's Skyscraper(un gratte-ciel hydropo...,Technology,Technology,EUR,2015-10-24,100000000.0,2015-08-25 23:52:30,2.0,failed,2,FR,2e-08
118402,1601563193,Our future,Space Exploration,Technology,AUD,2014-10-07,100000000.0,2014-08-08 19:05:39,1.0,failed,1,AU,1e-08
300797,601594365,The Multi-Trillion Dollar Dreamâ¢ (Canceled),Architecture,Design,USD,2015-03-02,100000000.0,2015-01-07 19:11:05,1.0,canceled,1,US,1e-08
55009,1279992058,Kybernan Holographic Gaming Network,Video Games,Games,USD,2016-01-01,100000000.0,2015-11-07 00:57:17,13.0,failed,4,US,1.3e-07


6. What is the average number of backers a project received?

In [17]:
print(kickstarter["backers"].mean(), kickstarter["backers"].median())

105.61747578969052 12.0


7. What is the average number of backers that *successful* projects received?

_Hint: Think about the order of operations of how to answer this. What do you need to do first?_

In [18]:
success = kickstarter[kickstarter["state"] == "successful"]
print(success["backers"].mean(), success["backers"].median())

263.92136223834694 71.0


_You can also use `.loc` for this to do both filtering and selecting a single column_

In [19]:
kickstarter.loc[kickstarter["state"] == "successful", "backers"].mean()

263.92136223834694

<h1 style="color: #fcd805">Exercise: Sorting and Aggregating</h1>

Back to the Kickstarter data.

1. What is the **total** amount pledged for songs *by category*?

You want to end up with a dataset of one line per category, showing the total pledged amount for each category.

In [20]:
kickstarter.groupby("category")["pledged"].sum()

category
Art             1.015470e+08
Comics          7.464365e+07
Crafts          1.776030e+07
Dance           1.390693e+07
Design          8.154909e+08
Fashion         1.494227e+08
Film & Video    4.045744e+08
Food            1.313787e+08
Games           7.703319e+08
Journalism      1.530200e+07
Music           2.072948e+08
Photography     3.950123e+07
Publishing      1.450902e+08
Technology      7.356088e+08
Theater         4.471301e+07
Name: pledged, dtype: float64

_A few ways to get around scientific notation. One is to round to 0 d.p. (the other options involve changing global `pandas` settings)_

In [21]:
kickstarter.groupby("category")["pledged"].sum().round()

category
Art             101547028.0
Comics           74643648.0
Crafts           17760300.0
Dance            13906929.0
Design          815490921.0
Fashion         149422710.0
Film & Video    404574432.0
Food            131378697.0
Games           770331916.0
Journalism       15301995.0
Music           207294847.0
Photography      39501225.0
Publishing      145090177.0
Technology      735608802.0
Theater          44713013.0
Name: pledged, dtype: float64

_Let's also sort this to better see the ordering_

In [22]:
kickstarter.groupby("category")["pledged"].sum().round().sort_values(ascending=False)

category
Design          815490921.0
Games           770331916.0
Technology      735608802.0
Film & Video    404574432.0
Music           207294847.0
Fashion         149422710.0
Publishing      145090177.0
Food            131378697.0
Art             101547028.0
Comics           74643648.0
Theater          44713013.0
Photography      39501225.0
Crafts           17760300.0
Journalism       15301995.0
Dance            13906929.0
Name: pledged, dtype: float64

2. What is the breakdown of the state of projects? That is, how many have failed, succeeded etc.? Calculate the answer both as absolute numbers and percentages.

In [23]:
kickstarter["state"].value_counts()

state
failed        197719
successful    133956
canceled       38779
undefined       3562
live            2799
suspended       1846
Name: count, dtype: int64

In [24]:
kickstarter["state"].value_counts(normalize=True)

state
failed        0.522153
successful    0.353762
canceled      0.102411
undefined     0.009407
live          0.007392
suspended     0.004875
Name: proportion, dtype: float64

3. Which category has the highest *average* pledged amount?

In [25]:
# can use either mean or median of course
kickstarter.groupby("category")["pledged"].median().sort_values(ascending=False)

category
Design          1989.5
Dance           1823.0
Theater         1500.0
Comics          1480.0
Games           1283.0
Music           1020.0
Film & Video     730.0
Art              424.0
Technology       331.0
Publishing       276.0
Food             255.0
Fashion          241.0
Photography      235.0
Crafts            95.0
Journalism        51.0
Name: pledged, dtype: float64

4. Find the most expensive (i.e. highest goal) project in the Photography category.

In [26]:
kickstarter[kickstarter["category"] == "Photography"].sort_values(by="goal", ascending=False).head()

Unnamed: 0,ID,name,subcategory,category,currency,deadline,goal,launched,pledged,state,backers,country,pct_goal_achieved
5076,1025947904,Long island city new york art book (Canceled),Photography,Photography,USD,2013-11-07,10000000.0,2013-10-08 23:16:26,0.0,canceled,0,US,0.0
135129,1686065004,Nature's Beauty and Wonders,Nature,Photography,USD,2015-07-05,7300000.0,2015-05-21 12:49:15,0.0,failed,0,US,0.0
162771,1828026144,"""SKY BLUE SKY"" - The Most Expen$ive Photograph...",Fine Art,Photography,USD,2016-03-07,6500001.0,2016-02-04 20:03:04,3.0,failed,3,US,4.615384e-07
7272,1036946517,Save Nature,Nature,Photography,USD,2016-07-12,2000000.0,2016-05-13 21:59:33,20.0,failed,1,US,1e-05
26680,1135431596,Challenge of the 14,Nature,Photography,DKK,2015-01-07,2000000.0,2014-11-28 21:41:41,2320.0,failed,6,DK,0.00116


5. Find the project in the Food category with the highest number of backers.

In [27]:
kickstarter[kickstarter["category"] == "Food"].sort_values(by="backers", ascending=False).head()

Unnamed: 0,ID,name,subcategory,category,currency,deadline,goal,launched,pledged,state,backers,country,pct_goal_achieved
246496,323562295,Misen: Cook Sharp,Food,Food,USD,2015-10-22,25000.0,2015-09-22 15:02:12,1083344.11,successful,13116,US,43.333764
184979,1941968078,Prepd Pack - The Lunchbox Reimagined,Food,Food,USD,2016-02-26,25000.0,2016-01-19 17:56:39,1439098.99,successful,12557,US,57.56396
139121,170672682,"The Field Skillet: Lighter, Smoother Cast Iron",Food,Food,USD,2016-04-07,30000.0,2016-03-07 02:25:16,1633361.53,successful,12553,US,54.445384
16654,108427247,Anova Precision Cooker - Cook sous vide with y...,Food,Food,USD,2014-06-17,100000.0,2014-05-06 09:01:51,1811321.6,successful,10508,US,18.113216
225764,217543389,The uKeg Pressurized Growler for Fresh Beer,Drinks,Food,USD,2014-12-08,75000.0,2014-10-15 06:34:48,1559525.68,successful,10293,US,20.793676


6. **BONUS** Find the project with the longest name.

_Hint: figure out how to calculate the length of the names first!_

In [28]:
kickstarter["name_length"] = kickstarter["name"].str.len()
kickstarter.sort_values(by="name_length", ascending=False).head()

Unnamed: 0,ID,name,subcategory,category,currency,deadline,goal,launched,pledged,state,backers,country,pct_goal_achieved,name_length
256578,374803524,BENEATH - Publish the book â¢ Visit another w...,Fiction,Publishing,USD,2010-09-24,5500.0,2010-08-10 21:54:47,1000.0,canceled,26,US,0.181818,99
241231,296883451,The Release of Sharp Tungâs exclusive Record...,Hip-Hop,Music,USD,2010-10-23,5000.0,2010-08-24 15:26:00,135.0,canceled,3,US,0.027,98
159093,1809078773,"One More Time â CD with 10 original songs, i...",Country & Folk,Music,USD,2010-09-20,4000.0,2010-08-16 01:26:24,30.0,canceled,2,US,0.0075,97
137838,1700039760,ACID MARSHMALLOW: Improve my Audio & Visual do...,Documentary,Film & Video,USD,2009-12-29,1600.0,2009-11-26 10:12:11,0.0,canceled,0,US,0.0,96
178536,1908331441,Always Something Doing: An Archive of Copyrigh...,Webseries,Film & Video,USD,2010-11-16,13000.0,2010-10-05 02:22:24,210.0,canceled,5,US,0.016154,96


_If you want to print the actual value without it truncating, you can access `.values` to get the underlying `numpy` array data_

In [29]:
kickstarter.sort_values(by="name_length", ascending=False)["name"].values[0]

'BENEATH - Publish the book â\x80¢ Visit another world â\x80¢ Lose your mind â\x80¢ Cringe in fear (Canceled)'

<h1 style="color: #fcd805">Exercise: combining data</h1>

1. Select all the rows from the `earning` table in the movies database into a `pandas` `DataFrame`.

In [30]:
import sqlite3

# connect to DB
conn = sqlite3.connect("./data/movies.sqlite")

# extract films
films = pd.read_sql("""
SELECT
    *
FROM
    IMDB
""", conn)

# extract genres
genres = pd.read_sql("""
SELECT
    *
FROM
    genre
""", conn)

# merge
films_merged = films.merge(genres, on="Movie_id", how="inner")

# dedupe
films_deduped = films_merged.drop_duplicates(subset=["Movie_id"], keep="first")
films_deduped.head()

Unnamed: 0,Movie_id,Title,Rating,TotalVotes,MetaCritic,Budget,Runtime,CVotes10,CVotes09,CVotes08,...,Votes3044M,Votes3044F,Votes45A,Votes45AM,Votes45AF,VotesIMDB,Votes1000,VotesUS,VotesnUS,genre
0,36809,12 Years a Slave (2013),8.1,496092,96.0,20000000.0,134 min,75556,126223,161460,...,7.9,8.0,7.8,7.8,8.1,8.0,7.7,8.3,8.0,Biography
3,30114,127 Hours (2010),7.6,297075,82.0,18000000.0,94 min,28939,44110,98845,...,7.5,7.5,7.3,7.3,7.5,7.6,7.0,7.7,7.6,Adventure
6,37367,50/50 (2011),7.7,283935,72.0,8000000.0,100 min,28304,47501,99524,...,7.6,7.6,7.4,7.4,7.5,7.4,7.0,7.9,7.6,Comedy
9,49473,About Time (2013),7.8,225412,,12000000.0,123 min,38556,43170,70850,...,7.6,7.7,7.6,7.5,7.8,7.7,6.9,7.8,7.7,Comedy
12,14867,Amour (2012),7.9,76121,94.0,8900000.0,127 min,11093,15944,22942,...,7.7,7.9,7.9,7.8,8.1,6.6,7.2,7.9,7.8,Drama


In [31]:
earnings = pd.read_sql("""
SELECT
    *
FROM
    earning
""", conn)

earnings.head()

Unnamed: 0,Movie_id,Domestic,Worldwide
0,36809,56671993,187733202.0
1,30114,18335230,60738797.0
2,37367,35014192,39187783.0
3,49473,15322921,87100449.0
4,14867,6739492,19839492.0


2. Now join the earnings data onto the merged film+genre data.

You should now have a `DataFrame` with one row per film and with genre and earnings data added on at the end.

Verify that this is the case before moving on.

In [32]:
films_all = films_deduped.merge(earnings, on="Movie_id")

print(films_all.shape)

films_all.head()

(117, 55)


Unnamed: 0,Movie_id,Title,Rating,TotalVotes,MetaCritic,Budget,Runtime,CVotes10,CVotes09,CVotes08,...,Votes45A,Votes45AM,Votes45AF,VotesIMDB,Votes1000,VotesUS,VotesnUS,genre,Domestic,Worldwide
0,36809,12 Years a Slave (2013),8.1,496092,96.0,20000000.0,134 min,75556,126223,161460,...,7.8,7.8,8.1,8.0,7.7,8.3,8.0,Biography,56671993,187733202.0
1,30114,127 Hours (2010),7.6,297075,82.0,18000000.0,94 min,28939,44110,98845,...,7.3,7.3,7.5,7.6,7.0,7.7,7.6,Adventure,18335230,60738797.0
2,37367,50/50 (2011),7.7,283935,72.0,8000000.0,100 min,28304,47501,99524,...,7.4,7.4,7.5,7.4,7.0,7.9,7.6,Comedy,35014192,39187783.0
3,49473,About Time (2013),7.8,225412,,12000000.0,123 min,38556,43170,70850,...,7.6,7.5,7.8,7.7,6.9,7.8,7.7,Comedy,15322921,87100449.0
4,14867,Amour (2012),7.9,76121,94.0,8900000.0,127 min,11093,15944,22942,...,7.9,7.8,8.1,6.6,7.2,7.9,7.8,Drama,6739492,19839492.0


3. Which film earned the least **domestically**?

In [33]:
films_all.sort_values("Domestic").head(1)

Unnamed: 0,Movie_id,Title,Rating,TotalVotes,MetaCritic,Budget,Runtime,CVotes10,CVotes09,CVotes08,...,Votes45A,Votes45AM,Votes45AF,VotesIMDB,Votes1000,VotesUS,VotesnUS,genre,Domestic,Worldwide
109,20709,Tyrannosaur (2011),7.6,26016,65,1000000.0,,2060,4083,9078,...,7.5,7.4,5.8,6.5,7.4,7.6,,Drama,22321,22321.0


4. Which film earned the most **worldwide**?

In [34]:
films_all.sort_values("Worldwide", ascending=False).head(1)

Unnamed: 0,Movie_id,Title,Rating,TotalVotes,MetaCritic,Budget,Runtime,CVotes10,CVotes09,CVotes08,...,Votes45A,Votes45AM,Votes45AF,VotesIMDB,Votes1000,VotesUS,VotesnUS,genre,Domestic,Worldwide
79,38626,Star Wars: The Force Awakens (2015),8.1,676732,81,245000000.0,136 min,155391,161810,166378,...,7.9,7.8,8.2,8.3,7.7,8.2,7.9,Action,936662225,2068224000.0


5. How many films have a MetaCritic score of less than 75?

_Note: to answer this question you'll have to fix the data type of the column first and you may need to deal with some non-numeric values!_

In [35]:
# either we assume empty strings are 0
films_all["MetaCritic"].replace("", 0).astype(int)

  films_all["MetaCritic"].replace("", 0).astype(int)


0      96
1      82
2      72
3       0
4      94
       ..
112    88
113    72
114    74
115    65
116    78
Name: MetaCritic, Length: 117, dtype: int32

In [36]:
# or we could replace them with a NULL value explicitly
# NOTE: in pandas, the default integer type doesn't support NULL values
# so we could convert the column to a float
import numpy as np

films_all["MetaCritic"].replace("", np.nan).astype(float)

  films_all["MetaCritic"].replace("", np.nan).astype(float)


0      96.0
1      82.0
2      72.0
3       NaN
4      94.0
       ... 
112    88.0
113    72.0
114    74.0
115    65.0
116    78.0
Name: MetaCritic, Length: 117, dtype: float64

Let's use the NULL approach since we don't know if those missing scores actually mean the film scored 0.

In [37]:
films_all["MetaCritic"] = films_all["MetaCritic"].replace("", np.nan).astype(float)

  films_all["MetaCritic"] = films_all["MetaCritic"].replace("", np.nan).astype(float)


In [38]:
under_75 = films_all[films_all["MetaCritic"] < 75]
len(under_75)

45

6. Which genre has the highest total domestic earnings?

In [39]:
films_all.groupby("genre")["Domestic"].sum().sort_values(ascending=False)

genre
Action       7142845464
Animation    2927485608
Adventure    1848562214
Drama        1463912296
Biography    1301951994
Crime         603980443
Comedy        498684743
Mystery       128012934
Name: Domestic, dtype: int64

7. Convert the `Runtime` column to numeric.

_Hint: You'll have to perform some string manipulation on it before you can do this._

In [40]:
films_all["runtime_mins"] = films_all["Runtime"].replace("", np.nan).str[:-4].astype(float)
films_all[["Runtime", "runtime_mins"]].head()

Unnamed: 0,Runtime,runtime_mins
0,134 min,134.0
1,94 min,94.0
2,100 min,100.0
3,123 min,123.0
4,127 min,127.0


8. Now find the genre with the highest **median** runtime.

In [41]:
films_all.groupby("genre")["runtime_mins"].median().sort_values(ascending=False)

genre
Mystery      138.0
Adventure    135.0
Action       131.0
Biography    127.0
Drama        126.0
Crime        125.0
Comedy       119.0
Animation    103.0
Name: runtime_mins, dtype: float64

<h1 style="color: #fcd805">Exercise: pub names</h1>

Let's do some open-ended data analysis with `pandas`!

We're going to find out what the most common pub name is in the UK.

1. Read in the file `open_pubs.csv` from the `data` folder into a `pandas` `DataFrame` (data originally from https://www.getthedata.com/open-pubs).

In [42]:
pubs = pd.read_csv("./data/open_pubs.csv")
pubs.head()

Unnamed: 0,22,Anchor Inn,"Upper Street, Stratford St Mary, COLCHESTER",CO7 6LW,604749,234404,51.970379,0.979340,Babergh
0,36,Ark Bar Restaurant,"Ark Bar And Restaurant, Cattawade Street, Bran...",CO11 1RH,610194,233329,51.958698,1.057832,Babergh
1,74,Black Boy,"The Lady Elizabeth, 7 Market Hill, SUDBURY, Su...",CO10 2EA,587334,241316,52.038595,0.729915,Babergh
2,75,Black Horse,"Lower Street, Stratford St Mary, COLCHESTER",CO7 6JS,622675,-5527598,\N,\N,Babergh
3,76,Black Lion,"Lion Road, Glemsford, SUDBURY",CO10 7RF,622675,-5527598,\N,\N,Babergh
4,97,Brewers Arms,"The Brewers Arms, Bower House Tye, Polstead, C...",CO6 5BZ,598743,240655,52.028694,0.895650,Babergh


2. Looks like there are no column headers!

Here is the data dictionary for the dataset:

|Field|Data type|Comments|
|---|---|---|
|fsa_id|int|Food Standard Agency's ID for this pub.|
|name|string|Name of the pub.|
|address|string|Address fields separated by commas.|
|postcode|string|Postcode of the pub.|
|easting|int| |
|northing|int| |
|latitude|decimal| |
|longitude|decimal| |
|local_authority|string|Local authority this pub falls under.|

Read the documentation for the `read_csv` method and figure out how to add column names to the data when you read it in.

In [43]:
pubs = pd.read_csv("./data/open_pubs.csv",
                   header=None,
                   names=["id", "name", "address", "postcode",
                          "easting", "northing", "latitude",
                          "longitude", "local_authority"])
pubs.head()

Unnamed: 0,id,name,address,postcode,easting,northing,latitude,longitude,local_authority
0,22,Anchor Inn,"Upper Street, Stratford St Mary, COLCHESTER",CO7 6LW,604749,234404,51.970379,0.979340,Babergh
1,36,Ark Bar Restaurant,"Ark Bar And Restaurant, Cattawade Street, Bran...",CO11 1RH,610194,233329,51.958698,1.057832,Babergh
2,74,Black Boy,"The Lady Elizabeth, 7 Market Hill, SUDBURY, Su...",CO10 2EA,587334,241316,52.038595,0.729915,Babergh
3,75,Black Horse,"Lower Street, Stratford St Mary, COLCHESTER",CO7 6JS,622675,-5527598,\N,\N,Babergh
4,76,Black Lion,"Lion Road, Glemsford, SUDBURY",CO10 7RF,622675,-5527598,\N,\N,Babergh


3. Check for any missing data. Drop any row with no name, since we need values from that column.

In [44]:
pubs.isnull().sum()

id                 0
name               0
address            0
postcode           0
easting            0
northing           0
latitude           0
longitude          0
local_authority    0
dtype: int64

4. Convert the `name` column (or whatever you called it) to the correct `string` type.

In [45]:
pubs.dtypes

id                  int64
name               object
address            object
postcode           object
easting             int64
northing            int64
latitude           object
longitude          object
local_authority    object
dtype: object

In [46]:
pubs["name"] = pubs["name"].astype("string")

5. Now convert the values in the `name` column to lowercase so that names like "The King's Arms" and "The king's arms" are treated as the same name.

In [47]:
pubs["name"] = pubs["name"].str.lower()

6. Use the `.str.strip()` method to remove any trailing whitespace from the `name` column.

In [48]:
pubs["name"] = pubs["name"].str.strip()

7. Now use `.str.replace` to remove the word "the" from the pub names, so that a pub called "The King's Head" will be treated as having the same name as one that's simply called "King's Head".

*Tip: take care not to replace words that **contain** the word `the` like "theatre"*

In [49]:
pubs["name"] = pubs["name"].str.replace("the ", "")

8. Use your `name` column to find the most common pub name in the UK.

In [50]:
pubs["name"].value_counts().head(10)

name
red lion                334
royal oak               286
crown inn               198
new inn                 181
white hart              165
kings arms              157
royal british legion    148
ship inn                140
kings head              130
plough inn              129
Name: count, dtype: Int64

BONUS: which local authority has the most of these pubs (i.e. the most pubs that have the most common name you found in question 8)?

In [51]:
pubs.loc[pubs["name"] == "red lion", "local_authority"].value_counts().head()

local_authority
West Northamptonshire    10
Buckinghamshire          10
East Lindsey              8
Cherwell                  7
Wiltshire                 7
Name: count, dtype: int64

BONUS: how many unique pub names are there in the data? That is, pub names that appear exactly once.

In [52]:
pub_counts = pubs["name"].value_counts()
pub_counts

name
red lion                                    334
royal oak                                   286
crown inn                                   198
new inn                                     181
white hart                                  165
                                           ... 
main stage                                    1
marys                                         1
mynyddygarreg r f c                           1
nantgaredig athletic rugby football club      1
y tai                                         1
Name: count, Length: 34111, dtype: Int64

This is a `pandas` `Series`. We can either filter it directly:

In [53]:
pub_counts[pub_counts == 1]

name
colville house                              1
carlton road sports and social club         1
crown inn snape                             1
felixstowe bowls club                       1
deben bar private members club              1
                                           ..
main stage                                  1
marys                                       1
mynyddygarreg r f c                         1
nantgaredig athletic rugby football club    1
y tai                                       1
Name: count, Length: 30728, dtype: Int64

Or we could explicitly convert it to a `DataFrame` first.

The `.reset_index` method converts the index of the `Series` to a column (which looks like the first column above, but it's technically not a column). This makes a `DataFrame`:

In [54]:
pub_counts_df = pub_counts.reset_index()
pub_counts_df.head()

Unnamed: 0,name,count
0,red lion,334
1,royal oak,286
2,crown inn,198
3,new inn,181
4,white hart,165


In [55]:
pub_counts_df[pub_counts_df["count"] == 1]

Unnamed: 0,name,count
3383,colville house,1
3384,carlton road sports and social club,1
3385,crown inn snape,1
3386,felixstowe bowls club,1
3387,deben bar private members club,1
...,...,...
34106,main stage,1
34107,marys,1
34108,mynyddygarreg r f c,1
34109,nantgaredig athletic rugby football club,1
