In [25]:
## Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

## Custom Functions for Analysis

In [26]:
def col_summary(df, col, dropna=False):
    
    """Takes in a Pandas DataFrame and specific column name. 
    Prints the number of unique values in the column and displays
    a DataFrame with the 5 most common and 5 least common values
    in that column as well as the count of each value. 
    Default is to also provide a count of NaN values.
    
    Args:
        df (DataFrame): DataFrame containing the column to summarize.
        col (str): Name of the column to be summarized.
        dropna (bool, default=False): Whether or not to drop null values.
    
    Example:
        >>> df = pd.DataFrame({'a': [2, 4, 4, 6],
                               'b': [2, 1, 3, 4]})
        >>> col_summary(df, col='a', dropna=False)
        
        ******************************
        Summary of a
        ******************************
        Total unique values: 3
        
            count 
        4   2  
        6   1
        2   1 
    """
    
    pd.options.display.max_rows = 10
    
    print('***'*10)
    print(f"Summary of {col}")
    print('***'*10)
    print(f"Total unique values: {df[col].nunique()}")
    
    unique_vals = pd.DataFrame()
    unique_vals['count'] = pd.Series(df[col].value_counts(dropna=dropna))
    display(unique_vals)

# Obtain

In [27]:
## Read in data files
popn_df = pd.read_csv('share-of-population-urban.csv')
gdp_df = pd.read_csv('taxes-on-incomes-of-individuals-and-corporations-gdp.csv')

In [28]:
## Inspect first 5 rows of each df
display(popn_df.head())
gdp_df.head()

Unnamed: 0,Entity,Code,Year,Urban_Population
0,Afghanistan,AFG,1960,8.401
1,Afghanistan,AFG,1961,8.684
2,Afghanistan,AFG,1962,8.976
3,Afghanistan,AFG,1963,9.276
4,Afghanistan,AFG,1964,9.586


Unnamed: 0,Entity,Code,Year,Tax_Percent_GDP
0,Afghanistan,AFG,2003,0.165953
1,Afghanistan,AFG,2004,0.411647
2,Afghanistan,AFG,2005,0.320864
3,Afghanistan,AFG,2006,1.261181
4,Afghanistan,AFG,2007,1.323461


## Initial Exploration of Population Dataset

The urban population dataset only has null values in the `Code` column. This likely shouldn't be an issue and I'll look to join the two DataFrames on `Entity` and then `Year` after further investigation.

The data types for each column make sense for the information they contain, except in the case of `Urban_Population`. This column reports population in millions of people, and thus should be a float, but is an object dtype. This will need to be converted to a numeric data type to enable regression analysis of the effect of `Urban_Population` on `Tax_Percent_GDP`. 
- Attempting to recast this column as type 'float64' produces the following error: 
```ValueError: could not convert string to float: '88%'``` 

The `Entity` column of the population dataset includes some completely nonsensical values (e.g., "43hj43"), as well as values that are not countries (e.g., "OECD members" and "Upper middle income"). These values may not appear in the GDP dataset and would thus be dropped through an inner join on `Entity`. This will require further investigation.

In [5]:
## Metadata for urban population dataset
popn_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15072 entries, 0 to 15071
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Entity            15072 non-null  object
 1   Code              12404 non-null  object
 2   Year              15072 non-null  int64 
 3   Urban_Population  15072 non-null  object
dtypes: int64(1), object(3)
memory usage: 471.1+ KB


In [32]:
## Investigate type and prevalence of unique values in each column of popn_df
for col in list(popn_df.columns):
    col_summary(popn_df, col)
    print('---'*5)
    print('\n\n')

******************************
Summary of Entity
******************************
Total unique values: 270


Unnamed: 0,count
OECD members,58
Upper middle income,58
Ecuador,58
Isle of Man,58
Venezuela,58
...,...
herger,1
43hj43,1
43hu,1
ho4u3h,1


---------------



******************************
Summary of Code
******************************
Total unique values: 215


Unnamed: 0,count
,2668
SXM,58
HKG,58
DEU,58
MKD,58
...,...
GBR,58
DNK,58
ERI,52
SRB,28


---------------



******************************
Summary of Year
******************************
Total unique values: 58


Unnamed: 0,count
2003,261
2011,261
2004,261
2007,261
1995,261
...,...
1987,259
1965,259
1988,259
1964,259


---------------



******************************
Summary of Urban_Population
******************************
Total unique values: 13664


Unnamed: 0,count
100,468
21.2,10
83.1,10
79.8,8
90.4,7
...,...
37.106,1
32.813,1
44.734,1
31.008,1


---------------





## Initial Exploration of GDP Dataset

The GDP dataset has no null values and the data type for each column aligns with the type of information in the column.

Upon initial inspection, the `Entity` column in this GDP dataset appears cleaner than that in the population dataset.

In [33]:
## Metadata for GDP dataset
gdp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4871 entries, 0 to 4870
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Entity           4871 non-null   object 
 1   Code             4871 non-null   object 
 2   Year             4871 non-null   int64  
 3   Tax_Percent_GDP  4871 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 152.3+ KB


In [34]:
## Investigate type and prevalence of unique values in each column of popn_df
for col in list(gdp_df.columns):
    col_summary(gdp_df, col)
    print('---'*5)
    print('\n\n')

******************************
Summary of Entity
******************************
Total unique values: 186


Unnamed: 0,count
Belgium,38
United Kingdom,38
Switzerland,38
Netherlands,38
Japan,38
...,...
Somalia,5
Kosovo,4
Iran,2
Bahamas,2


---------------



******************************
Summary of Code
******************************
Total unique values: 186


Unnamed: 0,count
CRI,38
FRA,38
GBR,38
SGP,38
DNK,38
...,...
SOM,5
OWID_KOS,4
IRN,2
BHS,2


---------------



******************************
Summary of Year
******************************
Total unique values: 38


Unnamed: 0,count
2002,169
2003,167
2001,166
2000,166
2004,166
...,...
1984,73
1983,71
1982,64
1980,61


---------------



******************************
Summary of Tax_Percent_GDP
******************************
Total unique values: 4853


Unnamed: 0,count
0.000000,16
5.176471,2
8.035714,2
2.760678,2
5.150947,1
...,...
12.528528,1
15.845956,1
12.751597,1
4.302662,1


---------------





## Merge the DataFrames