***
# Population (Raw Data Processing)
Capstone Project - Ali Sehpar Shikoh
***

<b> Previous Notebook: ManureApplied-RAW

<b> Next Notebook: RoundwoodProduction-RAW

Human population can have a positive as well as negative effect on crop yield. On one hand, increased population might result in increased labor in the field of agriculture, whereas on the other hand it might also result in the decrease of agricultural land due to its conversion for human housing purposes. This notebook deals with the processing of data related to the human population mentioned for each country on yearly basis.

### Exploratory Data Analysis

Importing the required packages.

In [1]:
import numpy as np
import pandas as pd

Importing csv file in dataframe called 'population_df1'.

In [2]:
population_df1= pd.read_csv("DataFiles/01-RawDataFiles/Population-RAW/Population-RAW.csv", encoding = 'latin1')

Looking at the imported dataset.

In [None]:
population_df1.head()

Unnamed: 0,Area Code,Area,Item Code,Item,Element Code,Element,Year Code,Year,Unit,Value,Flag,Note
0,2,Afghanistan,3010,Population - Est. & Proj.,511,Total Population - Both sexes,1950,1950,1000 persons,7752.118,X,
1,2,Afghanistan,3010,Population - Est. & Proj.,511,Total Population - Both sexes,1951,1951,1000 persons,7840.156,X,
2,2,Afghanistan,3010,Population - Est. & Proj.,511,Total Population - Both sexes,1952,1952,1000 persons,7935.997,X,
3,2,Afghanistan,3010,Population - Est. & Proj.,511,Total Population - Both sexes,1953,1953,1000 persons,8039.694,X,
4,2,Afghanistan,3010,Population - Est. & Proj.,511,Total Population - Both sexes,1954,1954,1000 persons,8151.317,X,


Checking out the unique 'Element' entries.

In [None]:
population_df1['Element'].unique()

array(['Total Population - Both sexes', 'Total Population - Male',
       'Total Population - Female', 'Rural population',
       'Urban population'], dtype=object)

As seen, there are a total of 12 columns.

Checking the distribution of entries in the 'Element' column.

In [None]:
population_df1['Element'].value_counts()

Total Population - Both sexes    39441
Total Population - Female        34674
Total Population - Male          34674
Rural population                 25811
Urban population                 25811
Name: Element, dtype: int64

As seen, the 'Element' column consists of multiple categories. Only the 'Total Population - Both sexes' category will be used as it is indicative of all the other categories mentioned in the 'Element' column.

Making sure that the data contained in the 'Unit' column has only one type of value.

In [None]:
population_df1['Unit'].unique()

array(['1000 persons'], dtype=object)

It turns out that the population dataset has only on unit type i.e. X 1000 persons.

Checking out unique entries in the 'Item' column.

In [None]:
population_df1['Item'].unique()

array(['Population - Est. & Proj.'], dtype=object)

The 'Item' column contains only one categorical value, thus, making this column less significant.

Looking at the values contained in the 'Flags' column.

In [None]:
population_df1['Flag'].unique()

array(['X', 'A'], dtype=object)

Checking the distribution of the two flag types mentioned above.

In [None]:
population_df1['Flag'].value_counts(normalize=True)*100

X    84.380747
A    15.619253
Name: Flag, dtype: float64

It seems that almost 85% of the values reported are official. the rest are either semi-official, estimated or calculated data. However, were are going to keep data related to both the sets in order to make the dataset as extensive as possible. Since all the categorical values mentioned in the 'Flag' column will be used, therefore, we can discard the 'Flag' column alongside the 'Item' column.

Checking out the value count for the Notes column.

In [None]:
population_df1['Note'].value_counts()

Series([], Name: Note, dtype: int64)

Notes Column seems to be empty so it is can also be dropped.

### Refined Dataset Creation and Exportation

Dropping various redundant columns and renaming the retained 'Value' column to 'Total Population (X 1000)'

In [None]:
population_df2 = population_df1.drop(['Item', 'Item Code', 'Element Code', 'Year Code', 'Unit', 'Note', 'Flag'], 1)
population_df2.rename(columns = {'Value':'Total Population (X 1000)'}, inplace = True)
population_df2

Unnamed: 0,Area Code,Area,Element,Year,Total Population (X 1000)
0,2,Afghanistan,Total Population - Both sexes,1950,7752.118
1,2,Afghanistan,Total Population - Both sexes,1951,7840.156
2,2,Afghanistan,Total Population - Both sexes,1952,7935.997
3,2,Afghanistan,Total Population - Both sexes,1953,8039.694
4,2,Afghanistan,Total Population - Both sexes,1954,8151.317
...,...,...,...,...,...
160406,5817,Net Food Importing Developing Countries,Urban population,2046,1383207.147
160407,5817,Net Food Importing Developing Countries,Urban population,2047,1418053.439
160408,5817,Net Food Importing Developing Countries,Urban population,2048,1453280.641
160409,5817,Net Food Importing Developing Countries,Urban population,2049,1488876.775


Since we are interested in the population of both sexes, therefore we will filter the element column to include values related to total population only. Also we will be filtering data for countries only.

In [None]:
population_df3 = population_df2.loc[(population_df2['Element'] == 'Total Population - Both sexes') & (population_df2['Area Code'] < 5000)]
population_df3

Unnamed: 0,Area Code,Area,Element,Year,Total Population (X 1000)
0,2,Afghanistan,Total Population - Both sexes,1950,7752.118
1,2,Afghanistan,Total Population - Both sexes,1951,7840.156
2,2,Afghanistan,Total Population - Both sexes,1952,7935.997
3,2,Afghanistan,Total Population - Both sexes,1953,8039.694
4,2,Afghanistan,Total Population - Both sexes,1954,8151.317
...,...,...,...,...,...
137842,181,Zimbabwe,Total Population - Both sexes,2096,30940.779
137843,181,Zimbabwe,Total Population - Both sexes,2097,30952.209
137844,181,Zimbabwe,Total Population - Both sexes,2098,30959.811
137845,181,Zimbabwe,Total Population - Both sexes,2099,30964.041


As seen no. total population columns are same as before. However the number of rows has decreased significantly.

Now we can drop the Element column as it contains only one value type.

In [None]:
population_df4 = population_df3.drop(['Element'], 1)
population_df4

Unnamed: 0,Area Code,Area,Year,Total Population (X 1000)
0,2,Afghanistan,1950,7752.118
1,2,Afghanistan,1951,7840.156
2,2,Afghanistan,1952,7935.997
3,2,Afghanistan,1953,8039.694
4,2,Afghanistan,1954,8151.317
...,...,...,...,...
137842,181,Zimbabwe,2096,30940.779
137843,181,Zimbabwe,2097,30952.209
137844,181,Zimbabwe,2098,30959.811
137845,181,Zimbabwe,2099,30964.041


Checking the dataset for null values and variable datatypes.

In [None]:
population_df4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34349 entries, 0 to 137846
Data columns (total 4 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Area Code                  34349 non-null  int64  
 1   Area                       34349 non-null  object 
 2   Year                       34349 non-null  int64  
 3   Total Population (X 1000)  34349 non-null  float64
dtypes: float64(1), int64(2), object(1)
memory usage: 1.3+ MB


As seen there are no null values in the dataset.

Resting the index.

In [None]:
population_df4.reset_index(drop = True, inplace= True)
population_df4

Unnamed: 0,Area Code,Area,Year,Total Population (X 1000)
0,2,Afghanistan,1950,7752.118
1,2,Afghanistan,1951,7840.156
2,2,Afghanistan,1952,7935.997
3,2,Afghanistan,1953,8039.694
4,2,Afghanistan,1954,8151.317
...,...,...,...,...
34344,181,Zimbabwe,2096,30940.779
34345,181,Zimbabwe,2097,30952.209
34346,181,Zimbabwe,2098,30959.811
34347,181,Zimbabwe,2099,30964.041


Filtering out the population dataset between year 1961 and 2019 as majority of the datasets have data for this time period.

In [None]:
population_df5 = population_df4.loc[(population_df4['Year'] >= 1961) & (population_df4['Year'] <= 2019)]
population_df5

Unnamed: 0,Area Code,Area,Year,Total Population (X 1000)
11,2,Afghanistan,1961,9169.410
12,2,Afghanistan,1962,9351.441
13,2,Afghanistan,1963,9543.205
14,2,Afghanistan,1964,9744.781
15,2,Afghanistan,1965,9956.320
...,...,...,...,...
34263,181,Zimbabwe,2015,13814.629
34264,181,Zimbabwe,2016,14030.331
34265,181,Zimbabwe,2017,14236.595
34266,181,Zimbabwe,2018,14438.802


We are left with 12886 rows. The dataset looks pretty good and ready to be exported to a CSV file for combining it with the main processed data set.

Exporting the refined dataset to a folder containing refined/filtered data and working files.

In [None]:
population_df5.to_csv(r'DataFiles/02-RefinedDataFiles/Population-REFINED.csv', index = False)

### Summary of things done in this notebook:

- Performed basic EDA.
- Discarded region based statistics by applying filter on the 'Area' column.
- Filtered population to incorporate combined statistics for both male and female.
- Dropped redundant columns.
- Incorporated more information in selected column names.
- Exported the refined data to a CSV file.
