# Lab Assignment 9: Data Management Using `pandas`, Part 2
## DS 6001: Practice and Application of Data Science

### Instructions
Please answer the following questions as completely as possible using text, code, and the results of code as needed. Format your answers in a Jupyter notebook. To receive full credit, make sure you address every part of the problem, and make sure your document is formatted in a clean and professional way.

## Problem 0
Import the following libraries:

In [1]:
import numpy as np
import pandas as pd

## Problem 1
In the first part of this lab, the goal is to merge data from the United Nations World Health Organization (https://www.who.int/who-un/en/) with data from the Varieties of Democracy Project (https://www.v-dem.net/en/). The UN-WHO studies health outcomes in a cross-national context, and V-Dem studies the quality of democracy as it changes across countries and over time. We would want to merge these two datasets together if we wanted to study whether democratic quality can predict health outcomes.

The UN data contains cross-national time series data from the United Nations and World Health Organization, and includes three features:

* The number of physicians per 1000 people
* The percent of the population that is malnourished
* Health expenditure per capita

The VDem data comes from the Varieties of Democracy project, which aims to measure the quality of democracy and the amount of corruption in different countries over time (https://www.v-dem.net/en/data/data-version-8/). This data file contains indices regarding a country’s democractic quality, level of civil liberites, and corruption. It also contains a binary indicator that separates countries into democratic and nondemocratic states, and it includes a categorizaton of the corruption scale.

The URLs for the two datasets are:

In [2]:
undata_url = "https://github.com/jkropko/DS-6001/raw/master/localdata/UNdata.csv"
VDem_url = "https://github.com/jkropko/DS-6001/raw/master/localdata/vdem.csv"

### Part a
Load both CSV files. Make sure to check whether there are rows that should not be included in the dataframe, and whether there are missing codes that should be replaced with `NaN`. Fix these problems at the data loading stage, if you can. (Don't worry about column names or category labels yet.) Also, the UN data covers the years 1960-2014, and the VDem data covers the years 1960-2015. To make the timeframe match up, delete rows in the VDem data from 2015. (1 point)

In [3]:
UN_data = pd.read_csv(undata_url, na_values = ['NA', '..'], skipfooter = 5)
UN_data

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],...,2006 [YR2006],2007 [YR2007],2008 [YR2008],2009 [YR2009],2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015]
0,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Afghanistan,AFG,0.034844,,,,,0.063428,...,0.136000,0.146000,0.145000,0.175000,0.194000,0.234000,0.225000,0.266000,,
1,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Albania,ALB,0.276291,,,,,0.481283,...,1.150000,1.146000,,1.144000,1.132000,1.113000,1.145000,1.145000,,
2,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Algeria,DZA,0.173148,,,,,0.116414,...,,1.207000,,,1.207000,,,,,
3,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,American Samoa,ASM,,,,,,,...,,,,,,,,,,
4,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Andorra,ADO,,,,,,,...,3.640000,3.716000,,3.912000,4.000000,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
769,Health expenditure per capita (current US$),SH.XPD.PCAP,West Bank and Gaza,WBG,,,,,,,...,,,,,,,,,,
770,Health expenditure per capita (current US$),SH.XPD.PCAP,World,WLD,,,,,,,...,748.814060,822.305555,895.873123,906.913629,948.697374,1019.763427,1026.154802,1041.749957,1060.987128,
771,Health expenditure per capita (current US$),SH.XPD.PCAP,"Yemen, Rep.",YEM,,,,,,,...,52.095753,58.088157,69.685148,65.913927,67.748743,64.651604,73.906693,78.522490,79.936966,
772,Health expenditure per capita (current US$),SH.XPD.PCAP,Zambia,ZMB,,,,,,,...,62.922780,48.175662,66.538544,53.659674,64.175104,70.525818,82.869198,87.833023,85.853074,


In [4]:
VDem_data = pd.read_csv(VDem_url)
VDem_data = VDem_data[VDem_data['year'] != 2015]
VDem_data.sort_values(by = 'year', ascending = False)

Unnamed: 0,X1,country_name,country_id,country_text_id,year,historical_date,codingstart,gapstart,gapend,codingend,...,v2xcs_ccsi_codehigh,v2xcs_ccsi_codelow,v2xps_party,v2xps_party_codehigh,v2xps_party_codelow,v2x_gender,v2x_gender_codehigh,v2x_gender_codelow,v2x_gencl,v2x_gencl_codehigh
2820,2821,Nepal,58,NPL,2014,2014-01-01,1900,,,2015,...,0.911314,0.699965,0.498516,0.649471,0.347777,,,,0.364904,0.520943
2541,2542,Bhutan,53,BTN,2014,2014-01-01,1900,,,2015,...,0.685898,0.378446,0.506593,0.673231,0.338780,,,,0.665078,0.801244
7877,7878,Mauritius,180,MUS,2014,2014-01-01,1900,,,2014,...,0.965865,0.800604,0.472824,0.680814,0.272150,0.773818,0.836344,0.711293,0.845570,0.927275
3038,3039,Zimbabwe,62,ZWE,2014,2014-01-01,1900,,,2015,...,0.624166,0.292913,0.443507,0.601263,0.294328,,,,0.508582,0.665112
2982,2983,Zambia,61,ZMB,2014,2014-01-01,1911,,,2015,...,0.870466,0.632406,0.378072,0.554108,0.224482,,,,0.659365,0.794777
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4810,4811,Spain,96,ESP,1960,1960-01-01,1900,,,2014,...,0.120080,0.014466,0.140586,0.267904,0.062243,0.233307,0.299597,0.167016,0.379845,0.551366
4865,4866,Syria,97,SYR,1960,1960-01-01,1918,1920.0,1922.0,2015,...,0.544189,0.234577,0.594737,0.753704,0.418123,,,,0.430169,0.599672
1183,1184,Honduras,27,HND,1960,1960-01-01,1900,,,2012,...,0.695802,0.351999,0.722834,0.848054,0.561368,0.346188,0.431875,0.260501,0.539862,0.708880
4921,4922,Tunisia,98,TUN,1960,1960-01-01,1900,,,2015,...,0.328505,0.096378,0.312544,0.500923,0.163638,0.459785,0.531622,0.387948,0.822688,0.910766


In [5]:
VDem_data.sort_values(by = 'country_name')

Unnamed: 0,X1,country_name,country_id,country_text_id,year,historical_date,codingstart,gapstart,gapend,codingend,...,v2xcs_ccsi_codehigh,v2xcs_ccsi_codelow,v2xps_party,v2xps_party_codehigh,v2xps_party_codelow,v2x_gender,v2x_gender_codehigh,v2x_gender_codelow,v2x_gencl,v2x_gencl_codehigh
1584,1585,Afghanistan,36,AFG,1962,1962-01-01,1900,,,2015,...,0.435168,0.160458,0.074516,0.162687,0.028557,0.181335,0.232855,0.129815,0.172381,0.301402
1608,1609,Afghanistan,36,AFG,1986,1986-01-01,1900,,,2015,...,0.147095,0.023951,0.143448,0.254819,0.070712,0.209011,0.273436,0.144586,0.222300,0.369985
1607,1608,Afghanistan,36,AFG,1985,1985-01-01,1900,,,2015,...,0.147095,0.023951,0.143448,0.254819,0.070712,0.209011,0.273436,0.144586,0.222300,0.369985
1606,1607,Afghanistan,36,AFG,1984,1984-01-01,1900,,,2015,...,0.147095,0.023951,0.143448,0.254819,0.070712,0.209011,0.273436,0.144586,0.222300,0.369985
1605,1606,Afghanistan,36,AFG,1983,1983-01-01,1900,,,2015,...,0.147095,0.023951,0.143448,0.254819,0.070712,0.209011,0.273436,0.144586,0.222300,0.369985
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3004,3005,Zimbabwe,62,ZWE,1980,1980-01-01,1900,,,2015,...,0.677566,0.377337,0.224675,0.370942,0.118277,0.421430,0.504905,0.337956,0.337485,0.501730
3005,3006,Zimbabwe,62,ZWE,1981,1981-01-01,1900,,,2015,...,0.682531,0.385353,0.244451,0.392316,0.133322,0.480997,0.567518,0.394476,0.399338,0.568122
3006,3007,Zimbabwe,62,ZWE,1982,1982-01-01,1900,,,2015,...,0.682531,0.385353,0.244451,0.392316,0.133322,0.480997,0.567518,0.394476,0.399338,0.568122
2995,2996,Zimbabwe,62,ZWE,1971,1971-01-01,1900,,,2015,...,0.365235,0.100077,0.287414,0.441301,0.164969,0.091116,0.132708,0.049525,0.047051,0.119639


### Part b
The UN data contain certain rows that refer to groups of countries instead of to individual countries. Here’s a list of these non-countries:

In [6]:
noncountries = ['Arab World',  'Caribbean small states',  'Central Europe and the Baltics', 
    'Early-demographic dividend',  'East Asia & Pacific', 'East Asia & Pacific (excluding high income)', 
    'East Asia & Pacific (IDA & IBRD countries)', 'Euro area', 'Europe & Central Asia', 
    'Europe & Central Asia (excluding high income)', 'Europe & Central Asia (IDA & IBRD countries)', 'European Union', 
    'Fragile and conflict affected situations', 'Heavily indebted poor countries (HIPC)', 
    'High income', 'Late-demographic dividend', 'Latin America & Caribbean', 
    'Latin America & Caribbean (excluding high income)', 
    'Latin America & the Caribbean (IDA & IBRD countries)', 'Least developed countries: UN classification', 
    'Low & middle income', 'Low income', 'Lower middle income', 
    'Middle East & North Africa', 'Middle East & North Africa (excluding high income)',
    'Middle East & North Africa (IDA & IBRD countries)', 
    'Middle income', 'North America', 'OECD members', 
    'Other small states', 'Pacific island small states', 'Post-demographic dividend', 
    'Pre-demographic dividend', 'Small states', 'South Asia', 
    'South Asia (IDA & IBRD)', 'Sub-Saharan Africa', 'Sub-Saharan Africa (excluding high income)', 
    'Sub-Saharan Africa (IDA & IBRD countries)', 'Upper middle income', 'World']

We can use `.query()` to remove the non-countries from the data, but in this case there are complications due to the space in the name of the column `Country Name` and the use of an external list. So here let's use an alternative method:

First, apply the `.isin(noncountries)` method to the `Country Name` column of the UN data to create a series of values that are `True` if the `Country Name` on a row is one of the non-countries, and `False` otherwise. Second, use the `~` operator to negate the logical values: turn `True` to `False` and vice versa. Finally, pass this logical series to the `.loc[]` attribute of the dataframe to drop the rows that refer to these noncountries from the UN data. (1 point)

(If you wanted to use `.query()`, you would first need to rename `Country Name` to remove the space, then you can use an `@` in front of `noncountries` to refer to the external list. But for this problem follow the instructions listed above.)

In [7]:
UN_Data = UN_data.rename(columns = {'Country Name': 'Country_Name'})
UN_Data.query('Country_Name not in @noncountries')

Unnamed: 0,Series Name,Series Code,Country_Name,Country Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],...,2006 [YR2006],2007 [YR2007],2008 [YR2008],2009 [YR2009],2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015]
0,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Afghanistan,AFG,0.034844,,,,,0.063428,...,0.136000,0.146000,0.145000,0.175000,0.194000,0.234000,0.225000,0.266000,,
1,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Albania,ALB,0.276291,,,,,0.481283,...,1.150000,1.146000,,1.144000,1.132000,1.113000,1.145000,1.145000,,
2,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Algeria,DZA,0.173148,,,,,0.116414,...,,1.207000,,,1.207000,,,,,
3,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,American Samoa,ASM,,,,,,,...,,,,,,,,,,
4,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Andorra,ADO,,,,,,,...,3.640000,3.716000,,3.912000,4.000000,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
768,Health expenditure per capita (current US$),SH.XPD.PCAP,Virgin Islands (U.S.),VIR,,,,,,,...,,,,,,,,,,
769,Health expenditure per capita (current US$),SH.XPD.PCAP,West Bank and Gaza,WBG,,,,,,,...,,,,,,,,,,
771,Health expenditure per capita (current US$),SH.XPD.PCAP,"Yemen, Rep.",YEM,,,,,,,...,52.095753,58.088157,69.685148,65.913927,67.748743,64.651604,73.906693,78.522490,79.936966,
772,Health expenditure per capita (current US$),SH.XPD.PCAP,Zambia,ZMB,,,,,,,...,62.922780,48.175662,66.538544,53.659674,64.175104,70.525818,82.869198,87.833023,85.853074,


In [8]:
country_name = UN_data['Country Name']
mask = ~country_name.isin(noncountries)
UN_data = UN_data.loc[mask]
UN_data

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],...,2006 [YR2006],2007 [YR2007],2008 [YR2008],2009 [YR2009],2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015]
0,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Afghanistan,AFG,0.034844,,,,,0.063428,...,0.136000,0.146000,0.145000,0.175000,0.194000,0.234000,0.225000,0.266000,,
1,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Albania,ALB,0.276291,,,,,0.481283,...,1.150000,1.146000,,1.144000,1.132000,1.113000,1.145000,1.145000,,
2,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Algeria,DZA,0.173148,,,,,0.116414,...,,1.207000,,,1.207000,,,,,
3,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,American Samoa,ASM,,,,,,,...,,,,,,,,,,
4,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Andorra,ADO,,,,,,,...,3.640000,3.716000,,3.912000,4.000000,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
768,Health expenditure per capita (current US$),SH.XPD.PCAP,Virgin Islands (U.S.),VIR,,,,,,,...,,,,,,,,,,
769,Health expenditure per capita (current US$),SH.XPD.PCAP,West Bank and Gaza,WBG,,,,,,,...,,,,,,,,,,
771,Health expenditure per capita (current US$),SH.XPD.PCAP,"Yemen, Rep.",YEM,,,,,,,...,52.095753,58.088157,69.685148,65.913927,67.748743,64.651604,73.906693,78.522490,79.936966,
772,Health expenditure per capita (current US$),SH.XPD.PCAP,Zambia,ZMB,,,,,,,...,62.922780,48.175662,66.538544,53.659674,64.175104,70.525818,82.869198,87.833023,85.853074,


### Part c
Reshape the UN data to move the years from the columns to the rows. (Once the years are in the rows, they will have values such as "1960 [YR1960]".) (2 points)

In [9]:
columns_that_will_continue_to_exist_as_columns = ['Series Name', 'Series Code', 'Country Name', 'Country Code']
columns_that_will_be_stored_in_rows = [column_name for column_name in UN_data.columns if column_name not in columns_that_will_continue_to_exist_as_columns]
UN_data = pd.melt(UN_data, columns_that_will_continue_to_exist_as_columns, columns_that_will_be_stored_in_rows)
UN_data

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,variable,value
0,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Afghanistan,AFG,1960 [YR1960],0.034844
1,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Albania,ALB,1960 [YR1960],0.276291
2,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Algeria,DZA,1960 [YR1960],0.173148
3,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,American Samoa,ASM,1960 [YR1960],
4,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Andorra,ADO,1960 [YR1960],
...,...,...,...,...,...,...
36451,Health expenditure per capita (current US$),SH.XPD.PCAP,Virgin Islands (U.S.),VIR,2015 [YR2015],
36452,Health expenditure per capita (current US$),SH.XPD.PCAP,West Bank and Gaza,WBG,2015 [YR2015],
36453,Health expenditure per capita (current US$),SH.XPD.PCAP,"Yemen, Rep.",YEM,2015 [YR2015],
36454,Health expenditure per capita (current US$),SH.XPD.PCAP,Zambia,ZMB,2015 [YR2015],


### Part d
Rename the `variable` column to `year`. Then use string methods to remove the ends such as "[YR1960]" from the values of the new `year` column and convert the column to an integer data type.

Also, for whatever reason, real world data often contains multiple variables that are just different representations of the same information. In this case, the `Series Name` and `Series Code` variables tell us exactly the same thing, and the `Country Name` and `Country Code` variables tell us exactly the same thing. Unless I have a very good reason to keep both, I generally prefer to drop variables that are redundant and coded in a less helpful way. So drop `Series Code` and `Country Code`. (2 points)

In [10]:
UN_data = UN_data.rename(columns = {'variable': 'year'})
UN_data['year'] = UN_data['year'].str.replace(r' \[YR\d{4}\]', '')
UN_data = UN_data.astype({'year': int})
UN_data = UN_data.drop(columns = ['Series Code', 'Country Code'])
UN_data

Unnamed: 0,Series Name,Country Name,year,value
0,"Physicians (per 1,000 people)",Afghanistan,1960,0.034844
1,"Physicians (per 1,000 people)",Albania,1960,0.276291
2,"Physicians (per 1,000 people)",Algeria,1960,0.173148
3,"Physicians (per 1,000 people)",American Samoa,1960,
4,"Physicians (per 1,000 people)",Andorra,1960,
...,...,...,...,...
36451,Health expenditure per capita (current US$),Virgin Islands (U.S.),2015,
36452,Health expenditure per capita (current US$),West Bank and Gaza,2015,
36453,Health expenditure per capita (current US$),"Yemen, Rep.",2015,
36454,Health expenditure per capita (current US$),Zambia,2015,


### Part e
Reshape the data to move the values of `Series Name` to separate columns. Make sure all of the columns exist in the dataframe after reshaping and are not stored in a row index or multi-index. Then rename the columns so that all of the columns have concise and descriptive names. (2 points)

In [11]:
pivot_table = pd.pivot_table(UN_data, values = 'value', index = ['Country Name', 'year'], columns = ['Series Name'])
numpy_record_array = pivot_table.to_records()
UN_data = pd.DataFrame(numpy_record_array)
UN_data = UN_data.rename(
    columns = {
        'Country Name': 'country',
        'Health expenditure per capita (current US$)': 'health_expenditure_per_capita_in_dollars',
        'Physicians (per 1,000 people)': 'physicians_per_1000_people',
        'Prevalence of undernourishment (% of population)': 'percent_undernourished'
    }
)
UN_data

Unnamed: 0,country,year,health_expenditure_per_capita_in_dollars,physicians_per_1000_people,percent_undernourished
0,Afghanistan,1960,,0.034844,
1,Afghanistan,1965,,0.063428,
2,Afghanistan,1970,,0.064900,
3,Afghanistan,1981,,0.077000,
4,Afghanistan,1986,,0.183100,
...,...,...,...,...,...
6396,Zimbabwe,2011,48.469580,0.083000,33.5
6397,Zimbabwe,2012,57.253763,,33.2
6398,Zimbabwe,2013,62.309228,,33.5
6399,Zimbabwe,2014,57.710452,,34.0


### Part f
Next we are going to join the cleaned UN data with the VDem data. In a perfect world, both datasets would include a shared numeric country ID field that we can use to match countries in one dataset to countries in the other. Unfortunately the UN data identifies the countries only by name. Worse still, while there is a big overlap the two datasets cover different sets of countries.

First decide whether this merge is a one-to-one, one-to-many, many-to-one, or many-to-many merge and describe your rationale in words.

Then perform a test merge that checks whether your expectation that the merge is one-to-one, one-to-many, many-to-one, or many-to-many is confirmed, and reports whether each row is matched, appears only in the UN data, or appears only in the VDem data. Use the `.unique()` or `.value_counts()` method to display the names of the countries that are not matched. (2 points)

Our merge is a many-to-many merge. Merge keys are unique neither in UN_data nor in VDem_data. There are multiple instances of multiple country names in each data frame.

In [12]:
UN_and_VDem_data = pd.merge(UN_data, VDem_data, left_on = 'country', right_on = 'country_name', how = 'outer', indicator = 'matched', validate = 'many_to_many')
UN_and_VDem_data

Unnamed: 0,country,year_x,health_expenditure_per_capita_in_dollars,physicians_per_1000_people,percent_undernourished,X1,country_name,country_id,country_text_id,year_y,...,v2xcs_ccsi_codelow,v2xps_party,v2xps_party_codehigh,v2xps_party_codelow,v2x_gender,v2x_gender_codehigh,v2x_gender_codelow,v2x_gencl,v2x_gencl_codehigh,matched
0,Afghanistan,1960.0,,0.034844,,1583.0,Afghanistan,36.0,AFG,1960.0,...,0.143550,0.074516,0.162687,0.028557,0.181335,0.232855,0.129815,0.172381,0.301402,both
1,Afghanistan,1960.0,,0.034844,,1584.0,Afghanistan,36.0,AFG,1961.0,...,0.160458,0.074516,0.162687,0.028557,0.181335,0.232855,0.129815,0.172381,0.301402,both
2,Afghanistan,1960.0,,0.034844,,1585.0,Afghanistan,36.0,AFG,1962.0,...,0.160458,0.074516,0.162687,0.028557,0.181335,0.232855,0.129815,0.172381,0.301402,both
3,Afghanistan,1960.0,,0.034844,,1586.0,Afghanistan,36.0,AFG,1963.0,...,0.140015,0.111794,0.212807,0.050778,0.181335,0.232855,0.129815,0.172381,0.301402,both
4,Afghanistan,1960.0,,0.034844,,1587.0,Afghanistan,36.0,AFG,1964.0,...,0.229685,0.177830,0.304231,0.090927,0.174778,0.232559,0.116997,0.167323,0.300779,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
253451,,,,,,8339.0,Slovakia,201.0,SVK,2008.0,...,0.761255,0.775410,0.873994,0.643597,0.889654,0.934997,0.844312,0.908137,0.960461,right_only
253452,,,,,,8340.0,Slovakia,201.0,SVK,2009.0,...,0.761255,0.775410,0.873994,0.643597,0.885768,0.931418,0.840118,0.908137,0.960461,right_only
253453,,,,,,8341.0,Slovakia,201.0,SVK,2010.0,...,0.761255,0.775410,0.873994,0.643597,0.876521,0.922915,0.830126,0.908137,0.960461,right_only
253454,,,,,,8342.0,Slovakia,201.0,SVK,2011.0,...,0.761255,0.775410,0.873994,0.643597,0.879075,0.925262,0.832887,0.908137,0.960461,right_only


In [13]:
UN_and_VDem_data['matched'] = UN_and_VDem_data['matched'].replace({'left_only': 'UN_data_only', 'right_only': 'VDem_data_only'})
UN_and_VDem_data

Unnamed: 0,country,year_x,health_expenditure_per_capita_in_dollars,physicians_per_1000_people,percent_undernourished,X1,country_name,country_id,country_text_id,year_y,...,v2xcs_ccsi_codelow,v2xps_party,v2xps_party_codehigh,v2xps_party_codelow,v2x_gender,v2x_gender_codehigh,v2x_gender_codelow,v2x_gencl,v2x_gencl_codehigh,matched
0,Afghanistan,1960.0,,0.034844,,1583.0,Afghanistan,36.0,AFG,1960.0,...,0.143550,0.074516,0.162687,0.028557,0.181335,0.232855,0.129815,0.172381,0.301402,both
1,Afghanistan,1960.0,,0.034844,,1584.0,Afghanistan,36.0,AFG,1961.0,...,0.160458,0.074516,0.162687,0.028557,0.181335,0.232855,0.129815,0.172381,0.301402,both
2,Afghanistan,1960.0,,0.034844,,1585.0,Afghanistan,36.0,AFG,1962.0,...,0.160458,0.074516,0.162687,0.028557,0.181335,0.232855,0.129815,0.172381,0.301402,both
3,Afghanistan,1960.0,,0.034844,,1586.0,Afghanistan,36.0,AFG,1963.0,...,0.140015,0.111794,0.212807,0.050778,0.181335,0.232855,0.129815,0.172381,0.301402,both
4,Afghanistan,1960.0,,0.034844,,1587.0,Afghanistan,36.0,AFG,1964.0,...,0.229685,0.177830,0.304231,0.090927,0.174778,0.232559,0.116997,0.167323,0.300779,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
253451,,,,,,8339.0,Slovakia,201.0,SVK,2008.0,...,0.761255,0.775410,0.873994,0.643597,0.889654,0.934997,0.844312,0.908137,0.960461,VDem_data_only
253452,,,,,,8340.0,Slovakia,201.0,SVK,2009.0,...,0.761255,0.775410,0.873994,0.643597,0.885768,0.931418,0.840118,0.908137,0.960461,VDem_data_only
253453,,,,,,8341.0,Slovakia,201.0,SVK,2010.0,...,0.761255,0.775410,0.873994,0.643597,0.876521,0.922915,0.830126,0.908137,0.960461,VDem_data_only
253454,,,,,,8342.0,Slovakia,201.0,SVK,2011.0,...,0.761255,0.775410,0.873994,0.643597,0.879075,0.925262,0.832887,0.908137,0.960461,VDem_data_only


In [14]:
UN_and_VDem_data_with_different_country_and_country_name = UN_and_VDem_data[UN_and_VDem_data['matched'] != 'both']
data_frame_with_country_and_country_name = UN_and_VDem_data_with_different_country_and_country_name[['country', 'country_name']]
data_frame_with_country_and_country_name.dropna(how = 'all')
series_with_country = data_frame_with_country_and_country_name['country'].fillna(data_frame_with_country_and_country_name['country_name'])
series_with_country.unique()

array(['American Samoa', 'Andorra', 'Antigua and Barbuda', 'Aruba',
       'Bahamas, The', 'Bahrain', 'Belize', 'Bermuda',
       'Brunei Darussalam', 'Cabo Verde', 'Cayman Islands',
       'Channel Islands', 'Congo, Dem. Rep.', 'Congo, Rep.',
       "Cote d'Ivoire", 'Dominica', 'Egypt, Arab Rep.',
       'Equatorial Guinea', 'French Polynesia', 'Gambia, The',
       'Greenland', 'Grenada', 'Guam', 'Hong Kong SAR, China',
       'Iran, Islamic Rep.', 'Kiribati', 'Korea, Dem. People’s Rep.',
       'Korea, Rep.', 'Kuwait', 'Kyrgyz Republic', 'Lao PDR',
       'Luxembourg', 'Macao SAR, China', 'Macedonia, FYR', 'Malta',
       'Marshall Islands', 'Micronesia, Fed. Sts.', 'Monaco', 'Myanmar',
       'Nauru', 'New Caledonia', 'Northern Mariana Islands', 'Oman',
       'Palau', 'Puerto Rico', 'Russian Federation', 'Samoa',
       'San Marino', 'Singapore', 'Slovak Republic',
       'St. Kitts and Nevis', 'St. Lucia',
       'St. Vincent and the Grenadines', 'Syrian Arab Republic',
       'T

In [15]:
series_with_country.value_counts().sort_index()

American Samoa            2
Andorra                  20
Antigua and Barbuda      25
Aruba                     1
Bahamas, The             26
                         ..
Vietnam_Republic of      16
Virgin Islands (U.S.)     2
West Bank and Gaza        2
Yemen                    55
Yemen, Rep.              31
Name: country, Length: 91, dtype: int64

### Part g
There are many unmatched rows in this merge. There are three reasons why rows failed to match:
* Differences in geographical coverage: for example, the VDem data includes Taiwan, but the UN data does not
* Differences in time coverage: for example, the UN data includes records for France every year from 1970 through 2014, and VDem includes rows for France from 1960 to 2012, leaving 12 rows for France without matching years
* Differences in spelling: for example, South Korea is called "Korea, Rep." in the UN data and "Korea_South" in the VDem data.

We can't do anything about differences in geographic or temporal coverage. But we can recode some country names to account for differences in spelling and to match more rows that should match. Here is a list of differently spelled countries:

* "Burma_Myanmar" in VDem is "Myanmar" in the UN data
* "Cape Verde" in VDem is "Cabo Verde" in the UN data
* "Congo_Democratic Republic of" in VDem is "Congo, Dem. Rep." in the UN data
* "Congo_Republic of the" in VDem is "Congo, Rep." in the UN data
* "East Timor" in VDem is "Timor-Leste" in the UN data
* "Egypt" in VDem is "Egypt, Arab Rep." in the UN data
* "Gambia" in VDem is "Gambia, The" in the UN data
* "Iran" in VDem is "Iran, Islamic Rep." in the UN data
* "Ivory Coast" in VDem is "Cote d’Ivoire" in the UN data
* "Korea_North" in VDem is "Korea, Dem. People’s Rep." in the UN data
* "Korea_South" in VDem is "Korea, Rep." in the UN data
* "Kyrgyzstan" in VDem is "Kyrgyz Republic" in the UN data
* "Laos" in VDem is "Lao PDR" in the UN data
* "Macedonia" in VDem is "Macedonia, FYR" in the UN data
* "Palestine_West_Bank" in VDem is "West Bank and Gaza" in the UN Data (there is also "Palestine_Gaza" in VDem, but since the UN combines data for the West Bank and Gaza, let's just use "Palestine_West_Bank" for this assignment)
* "Russia" in VDem is "Russian Federation" in the UN data
* "Slovakia" in VDem is "Slovak Republic" in the UN data
* "Syria" in VDem is "Syrian Arab Republic" in the UN data
* "Venezuela" in VDem is "Venezuela, RB" in the UN data
* "Vietnam_Democratic Republic of" in VDem is "Vietnam" in the UN data
* "Yemen" in VDem is "Yemen, Rep." in the UN data

Recode the country names listed above in one of the two dataframes to match the names in the other dataframe. Then perform an inner join of the two dataframes. Some rows will be dropped because of differences in coverage, but no rows will be dropped because of differences in spelling. (2 points)

In [16]:
VDem_country_name_to_UN_country = {
    'Burma_Myanmar': 'Myanmar',
    'Cape Verde': 'Cabo Verde',
    'Congo_Democratic Republic of': 'Congo, Dem. Rep.',
    'Congo_Republic of the': 'Congo, Rep.',
    'East Timor': 'Timor-Leste',
    'Egypt': 'Egypt, Arab Rep.',
    'Gambia': 'Gambia, The',
    'Iran': 'Iran, Islamic Rep.',
    'Ivory Coast': 'Cote d\'Ivoire',
    'Korea_North': 'Korea, Dem. People’s Rep.',
    'Korea_South': 'Korea, Rep.',
    'Kyrgyzstan': 'Kyrgyz Republic',
    'Laos': 'Lao PDR',
    'Macedonia': 'Macedonia, FYR',
    'Palestine_West_Bank': 'West Bank and Gaza',
    'Russia': 'Russian Federation',
    'Slovakia': 'Slovak Republic',
    'Syria': 'Syrian Arab Republic',
    'Venezuela': 'Venezuela, RB',
    'Vietnam_Democratic Republic of': 'Vietnam',
    'Yemen': 'Yemen, Rep.'
}
print(VDem_data.query('country_name in @VDem_country_name_to_UN_country.keys()')['country_name'].sort_values().unique().tolist() == list(VDem_country_name_to_UN_country.keys()))
print(UN_data.query('country in @VDem_country_name_to_UN_country.values()')['country'].sort_values().unique().tolist() == sorted(list(VDem_country_name_to_UN_country.values())))
VDem_data['country_name'] = VDem_data['country_name'].replace(VDem_country_name_to_UN_country)
print(VDem_data.query('country_name in @VDem_country_name_to_UN_country.values()')['country_name'].sort_values().unique().tolist() == sorted(list(VDem_country_name_to_UN_country.values())))
VDem_data
UN_data_and_VDem_data = pd.merge(UN_data, VDem_data, left_on = 'country', right_on = 'country_name', how = 'inner', indicator = 'matched', validate = 'many_to_many')
print(UN_data_and_VDem_data['matched'].value_counts())
UN_data_and_VDem_data

True
True
True
both          281772
left_only          0
right_only         0
Name: matched, dtype: int64


Unnamed: 0,country,year_x,health_expenditure_per_capita_in_dollars,physicians_per_1000_people,percent_undernourished,X1,country_name,country_id,country_text_id,year_y,...,v2xcs_ccsi_codelow,v2xps_party,v2xps_party_codehigh,v2xps_party_codelow,v2x_gender,v2x_gender_codehigh,v2x_gender_codelow,v2x_gencl,v2x_gencl_codehigh,matched
0,Afghanistan,1960,,0.034844,,1583,Afghanistan,36,AFG,1960,...,0.143550,0.074516,0.162687,0.028557,0.181335,0.232855,0.129815,0.172381,0.301402,both
1,Afghanistan,1960,,0.034844,,1584,Afghanistan,36,AFG,1961,...,0.160458,0.074516,0.162687,0.028557,0.181335,0.232855,0.129815,0.172381,0.301402,both
2,Afghanistan,1960,,0.034844,,1585,Afghanistan,36,AFG,1962,...,0.160458,0.074516,0.162687,0.028557,0.181335,0.232855,0.129815,0.172381,0.301402,both
3,Afghanistan,1960,,0.034844,,1586,Afghanistan,36,AFG,1963,...,0.140015,0.111794,0.212807,0.050778,0.181335,0.232855,0.129815,0.172381,0.301402,both
4,Afghanistan,1960,,0.034844,,1587,Afghanistan,36,AFG,1964,...,0.229685,0.177830,0.304231,0.090927,0.174778,0.232559,0.116997,0.167323,0.300779,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
281767,Zimbabwe,2015,,,33.4,3035,Zimbabwe,62,ZWE,2010,...,0.432416,0.438434,0.582274,0.302371,0.559720,0.641812,0.477629,0.459267,0.623338,both
281768,Zimbabwe,2015,,,33.4,3036,Zimbabwe,62,ZWE,2011,...,0.432416,0.438434,0.582274,0.302371,0.559720,0.641812,0.477629,0.459267,0.623338,both
281769,Zimbabwe,2015,,,33.4,3037,Zimbabwe,62,ZWE,2012,...,0.432416,0.438434,0.582274,0.302371,0.559720,0.641812,0.477629,0.459267,0.623338,both
281770,Zimbabwe,2015,,,33.4,3038,Zimbabwe,62,ZWE,2013,...,0.304144,0.443507,0.601263,0.294328,,,,0.508582,0.665112,both


## Problem 2
[Kickstarter](https://www.kickstarter.com/) is a website in which people can pledge financial support for creative projects. Patrons are only charged if a project raises enough money to meet a pre-specified goal, and projects can offer items as "rewards" for patrons who contribute at particular levels. One interesting aspect of Kickstarter is the ability to [search projects by "ending soon"](https://www.kickstarter.com/discover/advanced?sort=end_date). If you have a few dollars to spare and want to feel like a hero, you can swoop in at the last minute to contribute enough for a project to meet its goal.

Cathie So created a project on Kaggle in which she [scraped Kickstarter](https://www.kaggle.com/socathie/kickstarter-project-statistics/data?select=live.csv) and collected data on 4000 live projects (projects that were currently collecting pledges from patrons) as of October 10, 2016, at 5pm Pacific time. The data are here:

In [17]:
kickstarter = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/live.csv", index_col = 0)
kickstarter = kickstarter.sort_values(by = 'end.time')
kickstarter = kickstarter.reset_index(drop = True)
kickstarter

Unnamed: 0,amt.pledged,blurb,by,country,currency,end.time,location,percentage.funded,state,title,type,url
0,2530.0,\nHey friends! I'm trying to get back on the r...,Kendra Connally,US,usd,2016-10-29T19:40:28-04:00,"Boise, ID",33,ID,Sun Comes From The Mountains,Town,/projects/120686504/sun-comes-from-the-mountai...
1,1.0,\nThe place where chicken meets liquor for the...,Cherise Cox,US,usd,2016-10-29T19:43:54-04:00,"Manhattan, NY",0,NY,Drunken Wings,County,/projects/1397064906/drunken-wings?ref=discovery
2,115.0,\nSpinsOut uses Spin-trifical force to get all...,SpinsOut - Bottle Spinner,US,usd,2016-10-29T20:02:14-04:00,"Stillwater, MN",2,MN,A simple to use kitchen tool.,Town,/projects/1618978951/a-simple-to-use-kitchen-t...
3,1.0,\nI would like to take these children's books ...,Brett Droege,US,usd,2016-10-29T20:23:37-04:00,"New York, NY",0,NY,Adventures of Kratos Danger,Town,/projects/671720192/adventures-of-kratos-dange...
4,7048.0,\nA game like no other. Great entertainment f...,TESENT Games,US,usd,2016-10-29T20:46:23-04:00,"Idaho Falls, ID",48,ID,"Slydysk ""Playing the Game of Icebocce""",Town,/projects/icebocce/slydysk-playing-the-game-of...
...,...,...,...,...,...,...,...,...,...,...,...,...
3995,328.0,\nReal Indian Chai Premix with added low sugar...,Khan Luxury,CZ,eur,2016-12-28T09:59:39-05:00,"Prague, Czech Republic",1,Prague,Khan Luxury Premix Tea,Town,/projects/864511489/khan-luxury-premix-tea?ref...
3996,0.0,\nA kind of Music Tale is a 360° video documen...,Maud Watel Kazak,FR,eur,2016-12-28T12:06:50-05:00,"France, France",0,Auvergne,A KIND OF MUSIC TALE,Town,/projects/711646952/a-kind-of-music-tale?ref=d...
3997,4701.0,"\nLil' Bunny Sue Roux is part cat, bunny, kang...",Golden Bell Studios,US,usd,2016-12-28T12:59:22-05:00,"New Orleans, LA",47,LA,Lil' Bunny Sue Roux on Two: The Adopted Two Le...,Town,/projects/1561265809/lil-bunny-sue-roux-on-two...
3998,100.0,\nAnnual CIMSEC Outreach in international mari...,CIMSEC Treasurer,US,usd,2016-12-28T13:32:05-05:00,"Washington, DC",13,DC,CIMSEC Outreach 2016-2017,Town,/projects/79477252/cimsec-outreach-2016-2017?r...


### Part a
Notice that the `end.time` column, the date and time at which the project stops accepting pledges, is formatted as follows:
```
2016-11-01T23:59:00-04:00
```
This formatting is "YYYY-MM-DDThh:mm:ss-TZD": four digits for the year, a dash, two digits for the month, another dash, and two digits for the day; the "T" separates the dates from the time; two digits for the hour, minute and second, separated by colons; and the time zone expressed as hours difference from Greenwich mean time (also called UTC), and -04:00 is four hours earlier than UTC, for example.

But `end.time` is also currently read as a string, with `object` data type:

In [18]:
kickstarter.dtypes

amt.pledged          float64
blurb                 object
by                    object
country               object
currency              object
end.time              object
location              object
percentage.funded      int64
state                 object
title                 object
type                  object
url                   object
dtype: object

Convert `end.time` to a timestamp, and extract the month, day, year, hour, minute, and second of the end time. To allow the `pd.to_datetime()` function to read timezones, use the `utc=True` argument. (2 points)

In [19]:
kickstarter['end.time'] = pd.to_datetime(kickstarter['end.time'], utc = True)
kickstarter['year'] = [x.year for x in kickstarter['end.time']]
kickstarter['month'] = [x.month for x in kickstarter['end.time']]
kickstarter['day'] = [x.day for x in kickstarter['end.time']]
kickstarter['hour'] = [x.hour for x in kickstarter['end.time']]
kickstarter['minute'] = [x.minute for x in kickstarter['end.time']]
kickstarter['second'] = [x.second for x in kickstarter['end.time']]
print(kickstarter.dtypes)
kickstarter

amt.pledged                      float64
blurb                             object
by                                object
country                           object
currency                          object
end.time             datetime64[ns, UTC]
location                          object
percentage.funded                  int64
state                             object
title                             object
type                              object
url                               object
year                               int64
month                              int64
day                                int64
hour                               int64
minute                             int64
second                             int64
dtype: object


Unnamed: 0,amt.pledged,blurb,by,country,currency,end.time,location,percentage.funded,state,title,type,url,year,month,day,hour,minute,second
0,2530.0,\nHey friends! I'm trying to get back on the r...,Kendra Connally,US,usd,2016-10-29 23:40:28+00:00,"Boise, ID",33,ID,Sun Comes From The Mountains,Town,/projects/120686504/sun-comes-from-the-mountai...,2016,10,29,23,40,28
1,1.0,\nThe place where chicken meets liquor for the...,Cherise Cox,US,usd,2016-10-29 23:43:54+00:00,"Manhattan, NY",0,NY,Drunken Wings,County,/projects/1397064906/drunken-wings?ref=discovery,2016,10,29,23,43,54
2,115.0,\nSpinsOut uses Spin-trifical force to get all...,SpinsOut - Bottle Spinner,US,usd,2016-10-30 00:02:14+00:00,"Stillwater, MN",2,MN,A simple to use kitchen tool.,Town,/projects/1618978951/a-simple-to-use-kitchen-t...,2016,10,30,0,2,14
3,1.0,\nI would like to take these children's books ...,Brett Droege,US,usd,2016-10-30 00:23:37+00:00,"New York, NY",0,NY,Adventures of Kratos Danger,Town,/projects/671720192/adventures-of-kratos-dange...,2016,10,30,0,23,37
4,7048.0,\nA game like no other. Great entertainment f...,TESENT Games,US,usd,2016-10-30 00:46:23+00:00,"Idaho Falls, ID",48,ID,"Slydysk ""Playing the Game of Icebocce""",Town,/projects/icebocce/slydysk-playing-the-game-of...,2016,10,30,0,46,23
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3995,328.0,\nReal Indian Chai Premix with added low sugar...,Khan Luxury,CZ,eur,2016-12-28 14:59:39+00:00,"Prague, Czech Republic",1,Prague,Khan Luxury Premix Tea,Town,/projects/864511489/khan-luxury-premix-tea?ref...,2016,12,28,14,59,39
3996,0.0,\nA kind of Music Tale is a 360° video documen...,Maud Watel Kazak,FR,eur,2016-12-28 17:06:50+00:00,"France, France",0,Auvergne,A KIND OF MUSIC TALE,Town,/projects/711646952/a-kind-of-music-tale?ref=d...,2016,12,28,17,6,50
3997,4701.0,"\nLil' Bunny Sue Roux is part cat, bunny, kang...",Golden Bell Studios,US,usd,2016-12-28 17:59:22+00:00,"New Orleans, LA",47,LA,Lil' Bunny Sue Roux on Two: The Adopted Two Le...,Town,/projects/1561265809/lil-bunny-sue-roux-on-two...,2016,12,28,17,59,22
3998,100.0,\nAnnual CIMSEC Outreach in international mari...,CIMSEC Treasurer,US,usd,2016-12-28 18:32:05+00:00,"Washington, DC",13,DC,CIMSEC Outreach 2016-2017,Town,/projects/79477252/cimsec-outreach-2016-2017?r...,2016,12,28,18,32,5


### Part b
Create a dataframe with one row for every ending day in the `kickstarter` data that reports the average amount pledged (`amt.pledged`) on each day. Sort the rows in descending order by average amount pledged, and display the five days with the highest averages. (2 points)

In [20]:
data_frame_group_by = kickstarter.groupby(['year', 'month', 'day'])
data_frame = data_frame_group_by.mean()
numpy_record_array = data_frame.to_records()
data_frame = pd.DataFrame(numpy_record_array)
data_frame = data_frame[['year', 'month', 'day', 'amt.pledged']]
data_frame = data_frame.rename(columns = {'amt.pledged': 'average_amount_pledged'})
data_frame

Unnamed: 0,year,month,day,average_amount_pledged
0,2016,10,29,1265.500000
1,2016,10,30,10234.915254
2,2016,10,31,10301.485207
3,2016,11,1,10990.562092
4,2016,11,2,9810.437956
...,...,...,...,...
56,2016,12,24,9129.692308
57,2016,12,25,1424.272727
58,2016,12,26,36.857143
59,2016,12,27,360.285714


In [21]:
data_frame = data_frame.sort_values(by = 'average_amount_pledged', ascending = False)
data_frame = data_frame.reset_index(drop = True)
data_frame.head(n = 5)

Unnamed: 0,year,month,day,average_amount_pledged
0,2016,12,14,47938.375
1,2016,11,4,26975.388889
2,2016,11,11,24990.669065
3,2016,12,17,22160.230769
4,2016,11,18,21016.234043


### Part c
Display the text of the longest `blurb` in the data. (2 points)

In [22]:
index = kickstarter.blurb.str.len().idxmax()
print(kickstarter['blurb'][index])

max_length = 0
longest_blurb = None
for _, row in kickstarter.iterrows():
    blurb = row['blurb']
    if len(blurb) > max_length:
        max_length = len(blurb)
        longest_blurb = blurb
print(max_length)
print(longest_blurb)


'Down' is an experimental short film exploring the concepts of loss, nostalgia, and childhood by revisiting a waterpark 15 years later.

137

'Down' is an experimental short film exploring the concepts of loss, nostalgia, and childhood by revisiting a waterpark 15 years later.



### Part d
How many blurbs for projects with end dates between November 15, 2016 and December 7, 2016 contain the phrase "science fiction"? [Hint: Don't forget to make this search case-insensitive and to sort the `kickstarter` dataframe by `end.time` before setting `end.time` as the index.] (2 points)

In [23]:
filtered_kickstarter = kickstarter.copy(deep = True)
filtered_kickstarter.index = filtered_kickstarter['end.time']
filtered_kickstarter = filtered_kickstarter['2016-11-15T00:00:00-00:00':'2016-12-08T00:00:00-00:00']
filtered_kickstarter

Unnamed: 0_level_0,amt.pledged,blurb,by,country,currency,end.time,location,percentage.funded,state,title,type,url,year,month,day,hour,minute,second
end.time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2016-11-15 00:00:00+00:00,1205.0,"\nFor: Dice and/or TCG, Top loaders, Magic the...",Sam Zhao,GB,gbp,2016-11-15 00:00:00+00:00,"Edinburgh, UK",40,Scotland,World's First Lockable / Stackable Metal Deck ...,Town,/projects/373390096/worlds-first-lockable-stac...,2016,11,15,0,0,0
2016-11-15 00:00:00+00:00,9220.0,\nWelcome To Dark Skies 1942 an Alternate WWII...,RESIN HORSE Games,ES,eur,2016-11-15 00:00:00+00:00,"Las Palmas, Spain",1844,Canary Islands,DARK SKIES 1942: a 15mm aerial wargame,Town,/projects/1768803006/dark-skies-1942-a-15mm-ae...,2016,11,15,0,0,0
2016-11-15 00:52:19+00:00,5.0,\nAcquérir une franchise Meltdown est un rêve ...,Costa,FR,eur,2016-11-15 00:52:19+00:00,"Avignon, France",0,Provence-Alpes-Cote d'Azur,Ouverture Bar Gaming,Town,/projects/301577903/ouverture-bar-gaming?ref=d...,2016,11,15,0,52,19
2016-11-15 00:58:15+00:00,675.0,\nFINISHED short comedy needs funds to send fi...,Root Beer Studios,US,usd,2016-11-15 00:58:15+00:00,"Providence, RI",135,RI,"Coffee at Night ""Kickfinisher""",Town,/projects/1111400501/coffee-at-night-kickfinis...,2016,11,15,0,58,15
2016-11-15 01:00:16+00:00,6197.0,\nDid Hitler escape Nazi Germany? Our Allied ...,Eli Nitz for my daughter Hayley Nitz,US,usd,2016-11-15 01:00:16+00:00,"Olathe, KS",62,KS,Chasing Hitler: Issues 2-4,Town,/projects/1034181012/chasing-hitler-issues-2-4...,2016,11,15,1,0,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2016-12-07 08:11:19+00:00,16347.0,\nControl and manage any ERNEST device for bot...,ERNEST,US,usd,2016-12-07 08:11:19+00:00,"Miami, FL",32,FL,ERNEST - SAFETY & CONVENIENCE FOR THOSE YOU LOVE!,Town,/projects/geternestapp/ernest-safety-and-conve...,2016,12,7,8,11,19
2016-12-07 10:59:38+00:00,250.0,\nStep into an immersive experience of compass...,Sciosity,AU,aud,2016-12-07 10:59:38+00:00,"Perth, AU",0,WA,In Terror's Wake: A Virtual Reality Documentary,Town,/projects/1424073590/in-terrors-wake-a-virtual...,2016,12,7,10,59,38
2016-12-07 19:19:43+00:00,1602.0,\nJoin Keyed-In 2 Christ (Emma & Kristina) as ...,Keyed-In 2 Christ,US,usd,2016-12-07 19:19:43+00:00,"Centreville, VA",20,VA,Keyed-In 2 Christ Debut Worship EP (short album),Town,/projects/813635968/keyed-in-2-christ-debut-wo...,2016,12,7,19,19,43
2016-12-07 20:05:20+00:00,158.0,"\nWe present only the multifunction wallets, m...",DA VINCI workshop,UA,usd,2016-12-07 20:05:20+00:00,"Zaporizhzhya, Ukraine",5,Zaporizhia Oblast,Handcrafted leather wallets,Town,/projects/1954752520/handcrafted-leather-walle...,2016,12,7,20,5,20


In [24]:
filtered_kickstarter['blurb'] = filtered_kickstarter['blurb'].str.lower()
mask = filtered_kickstarter['blurb'].str.contains('science fiction')
filtered_kickstarter[mask]

Unnamed: 0_level_0,amt.pledged,blurb,by,country,currency,end.time,location,percentage.funded,state,title,type,url,year,month,day,hour,minute,second
end.time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2016-11-17 19:57:17+00:00,0.0,\nscience fiction action adventure comedy tech...,LaNard Morrison,US,usd,2016-11-17 19:57:17+00:00,"Houston, TX",0,TX,The Bulldogs (The bloodline),Town,/projects/314996571/the-bulldogs-the-bloodline...,2016,11,17,19,57,17
2016-11-18 06:15:00+00:00,875.0,\nthe exodus commission is a christian sci-fi ...,Virtual Exodus,US,usd,2016-11-18 06:15:00+00:00,"Riverside, CA",1,CA,The Exodus Commission - Christian Film & Graph...,Town,/projects/theexoduscommission/the-exodus-commi...,2016,11,18,6,15,0
2016-11-18 07:31:01+00:00,21364.0,\nspruitje makes futuristic designs with light...,Spruitje,NL,eur,2016-11-18 07:31:01+00:00,"Amsterdam, Netherlands",28,North Holland,Sustainable worlds,Town,/projects/1321031547/sustainable-worlds?ref=di...,2016,11,18,7,31,1
2016-11-29 01:00:00+00:00,5781.0,\nlegendary science fiction authors and the ma...,Randy Ritnour,US,usd,2016-11-29 01:00:00+00:00,"Lincoln, NE",165,NE,Kevin J. Anderson Presents Empire's Rift by St...,Town,/projects/takamo/kevin-j-anderson-presents-emp...,2016,11,29,1,0,0
2016-11-30 22:00:00+00:00,435.0,\nan anthology of science fiction and fantasy ...,Cheryl Morgan,GB,gbp,2016-11-30 22:00:00+00:00,"Trowbridge, UK",4,England,Piracity,Town,/projects/1914580668/piracity?ref=discovery,2016,11,30,22,0,0
2016-12-07 03:25:01+00:00,5299.0,\na science fiction film filled with entertain...,Chris,US,usd,2016-12-07 03:25:01+00:00,"Chicago, IL",529,IL,Ralphi3 The Movie,Town,/projects/257483623/ralphi3-the-movie?ref=disc...,2016,12,7,3,25,1


In [25]:
len(filtered_kickstarter.index)

1845