## More In-Depth Data Munging
Let's learn to deal with

- String manipulation / clean-up
- Detecting and dealing with missing data

Est. duration: 60 minutes

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/suneel0101/lesson-plan/master/crunchbase_monthly_export.csv",
    encoding='ISO-8859-1'
)

In [1]:
# Let's fix columns

In [4]:
df.head()

Unnamed: 0,permalink,name,homepage_url,category_list,market,funding_total_usd,status,country_code,state_code,region,city,funding_rounds,founded_at,founded_month,founded_quarter,founded_year,first_funding_at,last_funding_at,Unnamed: 18
0,/organization/canal-do-credito,Canal do Credito,http://www.canaldocredito.com.br,|Credit|Technology|Services|Finance|,Credit,750000,,BRA,,Rio de Janeiro,Belo Horizonte,1,,,,,1/1/10,1/1/10,
1,/organization/waywire,#waywire,http://www.waywire.com,|Entertainment|Politics|Social Media|News|,Entertainment,1750000,acquired,USA,NY,New York City,New York,1,6/1/12,2012-06,2012-Q2,2012.0,6/30/12,6/30/12,
2,/organization/tv-communications,&TV Communications,http://enjoyandtv.com,|Games|,Games,4000000,operating,USA,CA,Los Angeles,Los Angeles,2,,,,,6/4/10,9/23/10,
3,/organization/rock-your-paper,'Rock' Your Paper,http://www.rockyourpaper.org,|Publishing|Education|,Education,40000,operating,EST,,Tallinn,Tallinn,1,10/26/12,2012-10,2012-Q4,2012.0,8/9/12,8/9/12,
4,/organization/in-touch-network,(In)Touch Network,http://www.InTouchNetwork.com,|Electronics|Guides|Coffee|Restaurants|Music|i...,Apps,1500000,operating,GBR,,London,London,1,4/1/11,2011-04,2011-Q2,2011.0,4/1/11,4/1/11,


## Warm Up Exercise
Prepend `https://www.crunchbase.com` to `permalink` and save in new column called `full_url`

## Let's look at some string methods
- upper and replace
- `df['category_list'].str.split('|')`
- `df['category_list'].str.split('|').str.get(2)`

In [11]:
df.head()

Unnamed: 0,permalink,name,homepage_url,category_list,market,funding_total_usd,status,country_code,state_code,region,city,funding_rounds,founded_at,founded_month,founded_quarter,founded_year,first_funding_at,last_funding_at,Unnamed: 18
0,/organization/canal-do-credito,Canal do Credito,http://www.canaldocredito.com.br,|Credit|Technology|Services|Finance|,Credit,750000,,BRA,,Rio de Janeiro,Belo Horizonte,1,,,,,1/1/10,1/1/10,
1,/organization/waywire,#waywire,http://www.waywire.com,|Entertainment|Politics|Social Media|News|,Entertainment,1750000,acquired,USA,NY,New York City,New York,1,6/1/12,2012-06,2012-Q2,2012.0,6/30/12,6/30/12,
2,/organization/tv-communications,&TV Communications,http://enjoyandtv.com,|Games|,Games,4000000,operating,USA,CA,Los Angeles,Los Angeles,2,,,,,6/4/10,9/23/10,
3,/organization/rock-your-paper,'Rock' Your Paper,http://www.rockyourpaper.org,|Publishing|Education|,Education,40000,operating,EST,,Tallinn,Tallinn,1,10/26/12,2012-10,2012-Q4,2012.0,8/9/12,8/9/12,
4,/organization/in-touch-network,(In)Touch Network,http://www.InTouchNetwork.com,|Electronics|Guides|Coffee|Restaurants|Music|i...,Apps,1500000,operating,GBR,,London,London,1,4/1/11,2011-04,2011-Q2,2011.0,4/1/11,4/1/11,


In [6]:
df['category_list'].head()

0                 |Credit|Technology|Services|Finance|
1           |Entertainment|Politics|Social Media|News|
2                                              |Games|
3                               |Publishing|Education|
4    |Electronics|Guides|Coffee|Restaurants|Music|i...
Name: category_list, dtype: object

In [13]:
# df['category_list'].str.split('|')

## Exercise
Create a new column `first_category` that is just the first category in the list.

## Bonus Exercise
Create 3 new columns: cat_1, cat, cat_3 that are the first 3 categories in the list

Hint: 
- Use `expand=all` and `merge`
- [Resources](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)

## Dealing with missing data  / Data Cleanup
url = 'https://api.opencagedata.com/geocode/v1/json?q={}&key=dcf4dcba9494433eb351f4b8c3f9d839&pretty=1'

## Discussion
What are different ways dealing with missing data?

## Exercise 
Get all rows where the `country_code` is missing


## Exercise
Get a unique list of cities for the rows found in step 1

## Exercise
Using the following `url`, use the requests library to make a request and get back the JSON.

```python
>>> import requests
>>> res = requests.get(...)
>>>...
```

In [16]:
url = "https://api.opencagedata.com/geocode/v1/json?q=Boston&key=dcf4dcba9494433eb351f4b8c3f9d839&pretty=1"

## Exercise
Create a function `get_country_code` that takes a city name, puts it in the URL, e.g. 

`https://api.opencagedata.com/geocode/v1/json?q=THECITYNAME&key=dcf4dcba9494433eb351f4b8c3f9d839&pretty=1`

and returns the country code and if it can't find it or there is some error, return `None`

For example,

```python
get_country_code("Beijing")
>>> 'CN'

get_country_code(23452252252)
>>> None
```

Use this to fill in missing data for country code.