# Data Transformation
## Contents
1. Creating the Data


In [0]:
#Importing the library and removing the unwanted errors
import pandas as pd
import warnings
warnings.simplefilter('ignore')

## Creating the Data
* To understand the concepts we will work on small dataset. This time we will create our own data set.
* The simplest way to create dataset is to use list concept of python.
* Every list under the lsit wll represent every row & comma seperates the data column wise.
* Here we create two datasets:
 * GDP of countries.
 * Life expectancy of countries.
* We will then find answers to questions by analysing data.


In [2]:
#Creating GDP Dataset
table = [

['UK', 2678454886796.7], 

['USA', 16768100000000.0], 

['China', 9240270452047.0], 

['Brazil', 2245673032353.8],

['South Africa', 366057913367.1],

['India', 25000000000.345],

['Russia', 40575984903.67]

]
headings = ['country', 'GDP (US$)']
gdp = pd.DataFrame(columns=headings, data=table)
gdp

Unnamed: 0,country,GDP (US$)
0,UK,2678455000000.0
1,USA,16768100000000.0
2,China,9240270000000.0
3,Brazil,2245673000000.0
4,South Africa,366057900000.0
5,India,25000000000.0
6,Russia,40575980000.0


In [11]:
headings = ['Country name', 'Life expectancy (years)']

table = [

['China', 75],

['Russia', 71],

['United States of America', 79],

['India', 66],

['United Kingdom', 81],

['Brazil', 58],

['South Africa', 72]

]
life_expectancy = pd.DataFrame(columns=headings, data=table)
life_expectancy

Unnamed: 0,Country name,Life expectancy (years)
0,China,75
1,Russia,71
2,United States of America,79
3,India,66
4,United Kingdom,81
5,Brazil,58
6,South Africa,72


## Transforming Data
* We want to transform the GDP amount to the nearest millions to simplify the data readability.
* Therefore, we create function/method in python that will <b>round</b> the amount into nearest millions.
* If you dont want to return anything you can use: <pre>return None</pre>
* The built-in method <pre>round()</pre>rounds the value according to the BODMAS rules.
* We add one more method that expands the country names. 
for eg: USA-United States of America.

In [0]:
#User defined function that convertsamount to its nearest millions. 
def roundToMillions(value):
  result = round(value / 1000000)
  return result

In [0]:
#Function that abrrevates the acronym.
def expandCountry(name):
  if name == "USA":
    return "United States of America"
  elif name == "UK":
    return "United Kingdom"
  else:
    return name

In [0]:
#Convert the currency from US$ to UKpounds.
def ustoGbp(usd):
  pounds = usd/1.564768 #aerage rate in 2013
  return pounds

## Applying functions
* Lets look how we can use this functions for the columns in the dataset.
* We will use the function <pre>column.apply(<i>function_name</i>)</pre> to apply the user defined function to the particular column.

In [12]:
column = gdp['country'] #Assigning the column to the variable.
gdp['Country name'] = column.apply(expandCountry)
gdp

Unnamed: 0,country,GDP (US$),Country name,GDP (pounds in Millions)
0,UK,2678455000000.0,United Kingdom,1711727
1,USA,16768100000000.0,United States of America,10716029
2,China,9240270000000.0,China,5905202
3,Brazil,2245673000000.0,Brazil,1435148
4,South Africa,366057900000.0,South Africa,233937
5,India,25000000000.0,India,15977
6,Russia,40575980000.0,Russia,25931


In [13]:
#Selecting the new columns to handle data with.
gdp_in_pounds = gdp[['Country name','GDP (pounds in Millions)']]
gdp_in_pounds

Unnamed: 0,Country name,GDP (pounds in Millions)
0,United Kingdom,1711727
1,United States of America,10716029
2,China,5905202
3,Brazil,1435148
4,South Africa,233937
5,India,15977
6,Russia,25931


## Joining the tables
* Here both the tables have one common column <pre>Country name</pre>
* We <b>join</b> both the tables on the country name using the <pre>merge(<i>table_1, table_2, on=column_name, how=[left,] right, center])</i>)</pre>


In [14]:
#As we want all the columns of the left gdp_in_pounds table than we use join. 
pd.merge(gdp_in_pounds, life_expectancy, on='Country name', how='left')

Unnamed: 0,Country name,GDP (pounds in Millions),Life expectancy (years)
0,United Kingdom,1711727,81
1,United States of America,10716029,79
2,China,5905202,75
3,Brazil,1435148,58
4,South Africa,233937,72
5,India,15977,66
6,Russia,25931,71


In [0]:
# To get all the combinations we use outer Joins
df = pd.merge(gdp_in_pounds, life_expectancy, on='Country name', how='inner')

## Correlation between columns
* Correlation is to find the relationship between two table columns using statistical methods.
* We use scipy's spearmanr method to find the correlation between gdp and life expectancy. 

In [29]:
from scipy.stats import spearmanr
gdp_column = df['GDP (pounds in Millions)']
life_column = df['Life expectancy (years)']
(correlation, pvalue) = spearmanr(gdp_column, life_column)
print("Coorelation is", correlation)
if pvalue < 0.05:
  print("Data is statistically significant")
else: 
  print("Data is not statistically significant")

Coorelation is 0.6785714285714287
Data is not statistically significant


* This states that life expectancy is high in countries with high GDP.
* But it is not statistically signifanct because ther can be many factors affecting the life expectancy oter than GDP.
* Therefore, Always the quantitative analysis is not enough we also need qualitative analysis.
* Plotting the graph and visualize.