# 1. Introduction
### 1.1 Client Background and Problem Statement

### 1.2 Project Goal

# 2. Data Comprehension
### 2.1 Data Collection

### Data explanation
We started with exploring economic, and policy measures and found several that the team felt could give insights congruent to the needs of the client. Below is the indicator name defintion, and dataframe that we will read into the notebook for exploration and manipulation
#### Target Variables
1. **Gross Domestic Product per capita (df_gdp)**_*[$US/Capita]* - Monetary value of all of the good and services produced within a contries borders - seen as a key indicator of economic health
2. **Consumer Price Index (df_cpi)**_*[Index number] - An economic indicator that measures the average change in price for a basket of goods (essential household's goods and services) over time. it measures inflation
3. **Poverty Score (df_cpi)**_*[%Population]*

#### Features
Some of the measures below are united nation assessments **Country Policy and Institutional Assessment** which evaluate a country's policy and intitutional framework

- **CPIA - Business Regulatory Assessment (df_reg)**_*[Rating 1-6]* - A united nations rating that assesses how conducive a countries policies are for private sector development (e.g. Ease of operating a business, Regulatory framework, Property rights)
- **CPIA - Gender Equity (df_gender)**_*[Rating 1-6]* - rating that measures the extent to which a country's policies promote gender equity and empower women
- **CPIA - Social Inclusion (df_social)**_*[Rating 1-6]*
- **CPIA - Transparency Accountabilty and Corruption (df_tac)**_*[Rating 1-6]*
- **CPIA - Public Resource Equity (df_pre)**_*[Rating 1-6]*
- **Health expenditures (df_health)**_*[%US/GDP]*
- **Trade Exports (df_trade, df_pop)**_*[%US/Population]*
- **Trade Imports (df_trade, df_pop)**_*[%US/Population]*
- **Ease of Doing Business (df_edb)**_*[Rating 0-100]*
- **Income Distribution (df_inc2q, df_inc3q, df_inc4q, df_ind5q, df_incT10)**_*[%Population]*
- **Education expenditures (df_edu)**_*[$US/GDP]*
- **Secondary education enrollment (df_college)**_*[%Population]*

In [100]:
#Libraries for data collection, manipulation, and exploration
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [101]:
#### Target Variables
df_gdp = pd.read_csv('https://raw.githubusercontent.com/te-ex153/Data/refs/heads/main/GDP_per_cap_PPP.csv')
df_cpi = pd.read_csv('https://raw.githubusercontent.com/te-ex153/Data/refs/heads/main/CPI.csv')
df_pov = pd.read_csv('https://raw.githubusercontent.com/te-ex153/Data/refs/heads/main/Poverty_Pct_Pop.csv')

#### CPIA
df_edb = pd.read_csv('https://raw.githubusercontent.com/te-ex153/Data/refs/heads/main/Ease_Doing_Business.csv')
df_reg = pd.read_csv('https://raw.githubusercontent.com/te-ex153/Data/refs/heads/main/Business_Regulation.csv')
df_gender = pd.read_csv('https://raw.githubusercontent.com/te-ex153/Data/refs/heads/main/Gender_Equity.csv')
df_pre = pd.read_csv('https://raw.githubusercontent.com/te-ex153/Data/refs/heads/main/Public_Resource_Equity.csv')
df_social = pd.read_csv('https://raw.githubusercontent.com/te-ex153/Data/refs/heads/main/Social_Inclusion.csv')
df_tac = pd.read_csv('https://raw.githubusercontent.com/te-ex153/Data/refs/heads/main/Transparency_Accountability_Corruption.csv')

#### Government Expenditures
df_health = pd.read_csv('https://raw.githubusercontent.com/te-ex153/Data/refs/heads/main/Health_Spend.csv')
df_edu = pd.read_csv('https://raw.githubusercontent.com/te-ex153/Data/refs/heads/main/Education_Spend.csv')

####  Financial
df_trade = pd.read_csv('https://raw.githubusercontent.com/te-ex153/Data/refs/heads/main/Trade.csv')
df_inc2q = pd.read_csv('https://raw.githubusercontent.com/te-ex153/Data/refs/heads/main/income_2nd_quintile.csv')
df_inc3q = pd.read_csv('https://raw.githubusercontent.com/te-ex153/Data/refs/heads/main/income_3rd_quintile.csv')
df_inc4q = pd.read_csv('https://raw.githubusercontent.com/te-ex153/Data/refs/heads/main/income_4th_quintile.csv')
df_inc5q = pd.read_csv('https://raw.githubusercontent.com/te-ex153/Data/refs/heads/main/income_5th_quintile.csv')
df_incT10 = pd.read_csv('https://raw.githubusercontent.com/te-ex153/Data/refs/heads/main/income_highest_10.csv')

#### Other
df_college = pd.read_csv('https://raw.githubusercontent.com/te-ex153/Data/refs/heads/main/College_Enrollment.csv')
df_pop = pd.read_csv('https://raw.githubusercontent.com/te-ex153/Data/refs/heads/main/population2.csv')

### 2.1 EDA - Exploratory Data Analysis and Cleaning

In [102]:
'''
I know that i want to end up with a master dataframe with a index ordered by country then year
and columns of the various features mentioned above along with the three target variables
on the end of the table
Therefore for every df i will work to that end by
1. take an initial look at the data
2. address any missing values
3. strip white spaces 
4. drop any columns and rows that does not contribute to the study
5. reorganize standardize feature, level names, etc to merge seamlessly with other dfs moving forward
'''
##############  GDP Dataframe
nan_count_gdp = {
    'Colulmn': ['Country or Area', 'Year', 'Value'],
    'NoN Count': [
        df_gdp['Country or Area'].isnull().sum(), 
        df_gdp['Year'].isnull().sum(),
        df_gdp['Value'].isnull().sum()
    ]
}
nan_summary_gdp = pd.DataFrame(nan_count_gdp)
nan_summary_gdp   ######### NO missing data

df_gdp = df_gdp[['Country or Area', 'Year', 'Value']]  ########## Dropped a footnote column
df_gdp = df_gdp.applymap(lambda x: x.strip() if isinstance(x, str) else x)  ######### clean white space for all cells in table
df_gdp.columns = df_gdp.columns.str.strip()   ########## strip whitespace for the columns
df_gdp = df_gdp.rename(columns=str.lower) ######## make columns all lower case
df_gdp = df_gdp.rename(columns={'country or area': 'area', 'value': 'gdp'})    ############ change names to accomodate standardization moving forward
df_gdp['gdp'] = df_gdp['gdp'].round(2)  ################## round to two decimals
df_gdp['area'] = df_gdp['area'].astype(str)
df_gdp

Unnamed: 0,area,year,gdp
0,Afghanistan,2021,1673.96
1,Afghanistan,2020,2078.60
2,Afghanistan,2019,2168.13
3,Afghanistan,2018,2110.24
4,Afghanistan,2017,2096.09
...,...,...,...
7723,Zimbabwe,1994,1958.13
7724,Zimbabwe,1993,1765.45
7725,Zimbabwe,1992,1731.23
7726,Zimbabwe,1991,1907.65


In [103]:
import re
################# Repeat for CPI Data frome
'''
For cpi there are several columns that i dont need. i will only kep similar columns
to that of the gdp df. area year and value.   Also every other row shows 'CPI change' and
i dont need that in the initial df
'''
df_cpi = df_cpi.applymap(lambda x: x.strip() if isinstance(x, str) else x)  ###### strip whitespace
df_cpi.columns = df_cpi.columns.str.strip()

df_cpi = df_cpi[df_cpi['Description'].str.match(r'^\s*CPI\s*$', na=False)]  ####### drop unneeded rows
df_cpi = df_cpi[['Country or Area', 'Year', 'Value']]           ########## drop unneeded columns
df_cpi['Value'] = df_cpi['Value'].round(2)
df_cpi = df_cpi.rename(columns={'Country or Area': 'Area', 'Value': 'cpi'})
df_cpi = df_cpi.rename(columns=str.lower)
df_cpi['area'].unique

<bound method Series.unique of 1        ADVANCED ECONOMIES
3        ADVANCED ECONOMIES
5        ADVANCED ECONOMIES
7        ADVANCED ECONOMIES
9        ADVANCED ECONOMIES
                ...        
15277                 WORLD
15279                 WORLD
15281                 WORLD
15283                 WORLD
15284                 WORLD
Name: area, Length: 535, dtype: object>

# 3. Data Preparation
- Cleaning
- Transformation
- Feature engineering

# 4. Model
- Model Selection
- Training and testing
- Evaluation

# 5. Results
- Insights
- Reccomendations

# 6. Conclusion
- Recap
- Next Steps

# Appendix
- additional information visualizations, etc