# Project: Income Data Analysis

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> **Tip**: In this section of the report, provide a brief introduction to the dataset you've selected for analysis. At the end of this section, describe the questions that you plan on exploring over the course of the report. Try to build your report around the analysis of at least one dependent variable and three independent variables.
>
> If you haven't yet selected and downloaded your data, make sure you do that first before coming back here. If you're not sure what questions to ask right now, then make sure you familiarize yourself with the variables and the dataset context for ideas of what to explore.

The income per person of a country is a measure of how much money is earned by an individual. This is used as an indicator of the living conditions and quality of life in that country ([Investopedia](https://www.investopedia.com/terms/i/income-per-capita.asp)).
Many different variables can impact the income amount of an area. The data selected to this analysis is provided by [Gapminder World](https://www.gapminder.org/data/), and it is composed by four main indicators.
1. **Mean years at school**
   * The average number of years of school attended by men and women 25 years and older.
   

2. **Employment by sector**
   * This indicator is composed of three sectors, which are: Agriculture, Industry and Service. For every sector there's the percentage of all employment that works in that sector for each country.


3. **Employment by status**
   * The status includes by three categories, which are: Family workers, Salaried workers and Self-employed workers. For every status there's the percentage of all employment that works that way for each country.
   
   
4. **Income**
   * Income per person, which is calculated by: $\frac{Gross Domestic Product}{Country Population}$
   
   
The data is provided in csv format, however, the indicators employment by sector and status are split by the categories. Also, the mean years at school is divided between men and women data.

In [1]:
# Use this cell to set up import statements for all of the packages that you
#   plan to use.

# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### General Properties

In [8]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.


The data in Gapminder is provided separately by each indicator.

In [2]:
income = pd.read_csv('data/income_per_person_gdppercapita_ppp_inflation_adjusted.csv')

men_mean_yrs_school = pd.read_csv('data/mean_years_in_school_men_25_years_and_older.csv')
women_mean_yrs_school = pd.read_csv('data/mean_years_in_school_women_25_years_and_older.csv')

agriculture = pd.read_csv('data/agriculture_workers_percent_of_employment.csv')
industry = pd.read_csv('data/industry_workers_percent_of_employment.csv')
service = pd.read_csv('data/service_workers_percent_of_employment.csv')

family = pd.read_csv('data/family_workers_percent_of_employment.csv')
self_employed = pd.read_csv('data/self_employed_percent_of_employment.csv')
salaried = pd.read_csv('data/salaried_workers_percent_of_non_agricultural_employment.csv')

The first step will be understanding how the data is structured

In [3]:
income.head()

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2031,2032,2033,2034,2035,2036,2037,2038,2039,2040
0,Afghanistan,603,603,603,603,603,603,603,603,603,...,2420,2470,2520,2580,2640,2700,2760,2820,2880,2940
1,Albania,667,667,667,667,667,668,668,668,668,...,18500,18900,19300,19700,20200,20600,21100,21500,22000,22500
2,Algeria,715,716,717,718,719,720,721,722,723,...,15600,15900,16300,16700,17000,17400,17800,18200,18600,19000
3,Andorra,1200,1200,1200,1200,1210,1210,1210,1210,1220,...,73200,74800,76400,78100,79900,81600,83400,85300,87200,89100
4,Angola,618,620,623,626,628,631,634,637,640,...,6270,6410,6550,6700,6850,7000,7150,7310,7470,7640


In [4]:
men_mean_yrs_school.head()

Unnamed: 0,country,1970,1971,1972,1973,1974,1975,1976,1977,1978,...,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009
0,Afghanistan,0.7,0.7,0.8,0.8,0.8,0.9,0.9,0.9,1.0,...,2.1,2.1,2.2,2.3,2.3,2.4,2.4,2.5,2.6,2.6
1,Albania,5.1,5.2,5.3,5.5,5.6,5.7,5.9,6.0,6.1,...,9.1,9.2,9.4,9.5,9.6,9.8,9.9,10.0,10.1,10.2
2,Algeria,0.9,0.9,1.0,1.1,1.1,1.2,1.2,1.3,1.4,...,3.7,3.8,3.9,4.0,4.1,4.3,4.4,4.5,4.6,4.7
3,Angola,1.4,1.5,1.5,1.6,1.7,1.7,1.8,1.9,2.0,...,4.0,4.1,4.3,4.4,4.5,4.6,4.7,4.9,5.0,5.1
4,Antigua and Barbuda,7.0,7.1,7.2,7.4,7.5,7.6,7.8,7.9,8.1,...,11.0,11.1,11.2,11.3,11.4,11.5,11.5,11.6,11.7,11.8


### Changing the shape

Every indicator is one different csv file, so there's a column to the country and one row for each value. For every year there's a column, and depending on the indicator there are different number of years.

To make the dataframes more consistent and more appropriated to inspect the data it's better reshaping them. So, them will look like it:

| country    | year    | value  |
| -----------|:-------:| -----:|
| Brazil     | 1900 | 0.7 |
| Brazil     | 1901 | 0.8 |
| Canada     | 1900 | 0.75 |
| Germany | 1900    | 0.9 |

Firstly, I will change the income DataFrame to check how it looks.

In [5]:
income.shape

(193, 242)

Searching a bit on Google, I found the pandas function `melt`, which I can use to reshape the data.

In [6]:
income = income.melt(id_vars=['country'], var_name='year', value_name='income')

In [7]:
income.head()

Unnamed: 0,country,year,income
0,Afghanistan,1800,603
1,Albania,1800,667
2,Algeria,1800,715
3,Andorra,1800,1200
4,Angola,1800,618


In [8]:
income.shape

(46513, 3)

Since this is the way I want the DataFrames to be, so I will do the same change to the other dfs.

In [9]:
# Change the mean years in school data.
men_mean_yrs_school = men_mean_yrs_school.melt(id_vars=['country'], var_name='year', value_name='mean_years')
women_mean_yrs_school = women_mean_yrs_school.melt(id_vars=['country'], var_name='year', value_name='mean_years')

# Change the employment sector data.
agriculture = agriculture.melt(id_vars=['country'], var_name='year', value_name='agriculture_workers_perc')
industry = industry.melt(id_vars=['country'], var_name='year', value_name='industry_workers_perc')
service = service.melt(id_vars=['country'], var_name='year', value_name='service_workers_perc')

# Change the employment status data.
family = family.melt(id_vars=['country'], var_name='year', value_name='family_workers_perc')
salaried = salaried.melt(id_vars=['country'], var_name='year', value_name='salaried_workers_perc')
self_employed = self_employed.melt(id_vars=['country'], var_name='year', value_name='self_employed_workers_perc')

In [10]:
men_mean_yrs_school.head()

Unnamed: 0,country,year,mean_years
0,Afghanistan,1970,0.7
1,Albania,1970,5.1
2,Algeria,1970,0.9
3,Angola,1970,1.4
4,Antigua and Barbuda,1970,7.0


In [11]:
agriculture.head()

Unnamed: 0,country,year,agriculture_workers_perc
0,Afghanistan,1970,
1,Albania,1970,
2,Algeria,1970,
3,Angola,1970,
4,Antigua and Barbuda,1970,


### Inspecting data

Even after this transformation, the data for each indicator is split in different DataFrames, so to perform the  analysis it would be really repetitive to inspect nine dfs searching for inconsistences. Also, the different data frames for the same indicator just represents that the indicator has different categories. For example, the employment by sector has three different classes of sectors.

I will perform some inspections on the data to garantee that even the categories for the indicators are ok, and then merge the data to make the analysis more consistent.

Now that all the data frames are in the same format, I will procede inspecting the data.

#### Income data

In [12]:
income.shape

(46513, 3)

In [13]:
income.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46513 entries, 0 to 46512
Data columns (total 3 columns):
country    46513 non-null object
year       46513 non-null object
income     46513 non-null int64
dtypes: int64(1), object(2)
memory usage: 1.1+ MB


In [15]:
sum(income.duplicated())

0

In [16]:
income.nunique()

country     193
year        241
income     2379
dtype: int64

In [19]:
income.isna().sum()

country    0
year       0
income     0
dtype: int64

As we can see above, the data for income indicator is really complete. There's no duplicated or missing values. Also, the data type is appropriated to the analysis.

#### Mean years at school

In [21]:
men_mean_yrs_school.shape

(6960, 3)

In [22]:
women_mean_yrs_school.shape

(6960, 3)

In [23]:
men_mean_yrs_school.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6960 entries, 0 to 6959
Data columns (total 3 columns):
country       6960 non-null object
year          6960 non-null object
mean_years    6960 non-null float64
dtypes: float64(1), object(2)
memory usage: 163.2+ KB


In [None]:
women_mean_yrs_school.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6960 entries, 0 to 6959
Data columns (total 3 columns):
country       6960 non-null object
year          6960 non-null object
mean_years    6960 non-null float64
dtypes: float64(1), object(2)
memory usage: 163.2+ KB


In [32]:
men_mean_yrs_school.duplicated().sum()

0

In [33]:
women_mean_yrs_school.duplicated().sum()

0

In [34]:
men_mean_yrs_school.isna().sum()

country       0
year          0
mean_years    0
dtype: int64

In [38]:
women_mean_yrs_school.isna().sum()

country       0
year          0
mean_years    0
dtype: int64

In [39]:
men_mean_yrs_school.country.unique().all() == women_mean_yrs_school.country.unique().all()

True

In [40]:
men_mean_yrs_school.year.unique().all() == women_mean_yrs_school.year.unique().all()

True

The `mean years in school` for both genders is very consistent. The data don't have missing values and duplicates. Furthermore, the number of values for each column, and the values for country and years are the same, which shows that men and women data have values for the same period of time and region.

#### Employment by sector

In [41]:
agriculture.shape

(8448, 3)

In [42]:
industry.shape

(8448, 3)

In [43]:
service.shape

(8448, 3)

In [44]:
agriculture.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8448 entries, 0 to 8447
Data columns (total 3 columns):
country                     8448 non-null object
year                        8448 non-null object
agriculture_workers_perc    3470 non-null float64
dtypes: float64(1), object(2)
memory usage: 198.1+ KB


In [45]:
industry.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8448 entries, 0 to 8447
Data columns (total 3 columns):
country                  8448 non-null object
year                     8448 non-null object
industry_workers_perc    3534 non-null float64
dtypes: float64(1), object(2)
memory usage: 198.1+ KB


In [46]:
service.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8448 entries, 0 to 8447
Data columns (total 3 columns):
country                 8448 non-null object
year                    8448 non-null object
service_workers_perc    3534 non-null float64
dtypes: float64(1), object(2)
memory usage: 198.1+ KB


In [48]:
agriculture.duplicated().sum()

0

In [49]:
industry.duplicated().sum()

0

In [50]:
service.duplicated().sum()

0

In [51]:
agriculture.isna().sum()

country                        0
year                           0
agriculture_workers_perc    4978
dtype: int64

In [52]:
industry.isna().sum()

country                     0
year                        0
industry_workers_perc    4914
dtype: int64

In [53]:
service.isna().sum()

country                    0
year                       0
service_workers_perc    4914
dtype: int64

In [56]:
agriculture.nunique()

country                      176
year                          48
agriculture_workers_perc    1295
dtype: int64

In [57]:
industry.nunique()

country                  176
year                      48
industry_workers_perc    495
dtype: int64

In [58]:
service.nunique()

country                 176
year                     48
service_workers_perc    665
dtype: int64

In [54]:
agriculture.country.unique().all() == industry.country.unique().all() == service.country.unique().all()

True

In [55]:
agriculture.year.unique().all() == industry.year.unique().all() == service.year.unique().all()

True

As shown above, the data for the `employment by sector` indicator have some missing values. However, the countries and years presented in the three DataFrames are the same.

#### Employment by status

In [60]:
family.shape

(8208, 3)

In [61]:
salaried.shape

(9940, 3)

In [62]:
self_employed.shape

(5728, 3)

In [63]:
family.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8208 entries, 0 to 8207
Data columns (total 3 columns):
country                8208 non-null object
year                   8208 non-null object
family_workers_perc    2473 non-null float64
dtypes: float64(1), object(2)
memory usage: 192.5+ KB


In [64]:
salaried.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9940 entries, 0 to 9939
Data columns (total 3 columns):
country                  9940 non-null object
year                     9940 non-null object
salaried_workers_perc    1760 non-null float64
dtypes: float64(1), object(2)
memory usage: 233.0+ KB


In [65]:
self_employed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5728 entries, 0 to 5727
Data columns (total 3 columns):
country                       5728 non-null object
year                          5728 non-null object
self_employed_workers_perc    5728 non-null float64
dtypes: float64(1), object(2)
memory usage: 134.3+ KB


In [66]:
family.duplicated().sum()

0

In [67]:
salaried.duplicated().sum()

0

In [68]:
self_employed.duplicated().sum()

0

In [69]:
family.isna().sum()

country                   0
year                      0
family_workers_perc    5735
dtype: int64

In [70]:
salaried.isna().sum()

country                     0
year                        0
salaried_workers_perc    8180
dtype: int64

In [71]:
self_employed.isna().sum()

country                       0
year                          0
self_employed_workers_perc    0
dtype: int64

In [72]:
family.nunique()

country                171
year                    48
family_workers_perc    896
dtype: int64

In [73]:
salaried.nunique()

country                  142
year                      70
salaried_workers_perc    461
dtype: int64

In [74]:
self_employed.nunique()

country                        179
year                            32
self_employed_workers_perc    1130
dtype: int64

In [54]:
family.country.unique().all() == salaried.country.unique().all() == self_employed.country.unique().all()

True

In [75]:
family.year.unique().all() == salaried.year.unique().all() == self_employed.year.unique().all()

False

The data for the `employment by status` is not as consistent as the other indicators. There is different number of years and countries for each category. Also, there're null values for the categories `family` and `salaried`, but no category has duplicated values.

> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning (Replace this with more specific notes!)

In [None]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.


<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!