# Week 3
In this week you'll deepen your understanding of the python pandas library by learning how to merge DataFrames, generate summary tables, group data into logical pieces, and manipulate dates. We'll also refresh your understanding of scales of data, and discuss issues with creating metrics for analysis. The week ends with a more significant programming assignment.

### Learning Objectives
* Apply merge and join on DataFrames
* Employ slicing and indexing on DataFrames
* Analyze data with groupby and understand categorical variables
* Produce the entire process of data source to elucidation
* Examine the data by manipulating, cutting, and applying aggregate functions to DataFrames

In [63]:
# IDIOMS of PANDAS

import pandas as pd
import numpy as np

df = pd.DataFrame()

df['Product'] = pd.Series(['Book', 'Pencil', 'Eraser', 'Work book', 'Sketch'])
df['Quantity'] = pd.Series([1, 0, 3, 0, 1])
df['Price'] = pd.Series([12, 20, 0, 0, 20])
df['Old_Price'] = pd.Series([13, 10, 40, 30, 0])


x = df[df['Quantity'] == 0].index

df.drop(x, axis=0)


Unnamed: 0,Product,Quantity,Price,Old_Price
0,Book,1,12,13
2,Eraser,3,0,40
4,Sketch,1,20,0


In [64]:
# Pandorable or Idioms of Pandas
# high efficient, readable problem specific solution
print((df
       .set_index('Quantity')
       .drop(labels=1, axis=0)
       .rename(columns={'Product':'Product ID'})
       .reset_index()
))

# Above process resets the index values.  Below process preserves the index values.

# Target is to remove rows where Quantity = 0 and Rename Column from 'Weight': 'Weight (oz.)'
print(df.drop(df[df['Quantity'] == 0].index).rename(columns={'Weight': 'Weight (oz.)'}))

# DROP method removes rows by taking index values as input and axis as 0 for rows. This preserves the original index values

   Quantity Product ID  Price  Old_Price
0         0     Pencil     20         10
1         3     Eraser      0         40
2         0  Work book      0         30
  Product  Quantity  Price  Old_Price
0    Book         1     12         13
2  Eraser         3      0         40
4  Sketch         1     20          0


In [None]:
# 2nd Idiom : ApplyMap : method=Dataframe.apply

# two methods : apply, applymap
# df.apply : for both series and data frames


def min_max(row):
    data = row[['Product', 'Price', 'Old_Price']]
    
    return pd.Series({'min' : np.min(data), 'max' : np.max(data)})

df.apply(min_max , axis=1)
# Every time a row goes as input to the called method. Which means a method is applied on every row/cell of the data frame

## Assignment practice


In [12]:
import numpy as np
import pandas as pd

In [13]:
def answer_one():
    # load energy data
    energy = pd.read_excel('Energy Indicators.xls', skiprows=17, skipfooter=38)
    energy = energy[['Unnamed: 2', 'Petajoules', 'Gigajoules', '%']]
    energy.rename(columns={'Unnamed: 2' : 'Country', 'Petajoules' : 'Energy Supply', 'Gigajoules' : 'Energy Supply per Capita', '%' : '% Renewable'}, inplace=True)
    energy.replace(to_replace='...', value=np.nan, inplace=True)
    energy.replace({'China, Hong Kong Special Administrative Region':'Hong Kong','United Kingdom of Great Britain and Northern Ireland':'United Kingdom','Republic of Korea':'South Korea','United States of America':'United States','Iran (Islamic Republic of)':'Iran'}, regex=True, inplace=True)
    energy.replace(to_replace=['[0-9]*', '\(.*\)', ',\s.*', '\s$', '^\s'], value="", regex=True, inplace=True)

    energy['Energy Supply'] = energy['Energy Supply']*1000000
    
    # GDP data
    GDP = pd.read_csv('world_bank.csv', skiprows=4)
    GDP.rename({"Korea, Rep.": "South Korea", "Iran, Islamic Rep.": "Iran", "Hong Kong SAR, China": "Hong Kong"})
    
    # cimEn
    ScimEn = pd.read_excel('scimagojr-3.xlsx')
        
    return (energy, GDP, ScimEn)

In [None]:
energy, GDP, ScimEn = answer_one()

# Join the three datasets: GDP, Energy, and ScimEn into a new dataset (using the intersection of country names). 
# Use only the last 10 years (2006-2015) of GDP data and only the top 15 countries by Scimagojr 'Rank'
# (Rank 1 through 15).

# The index of this DataFrame should be the name of the country, and the columns should be 
# ['Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations', 'Citations per document', 'H index', 
# 'Energy Supply', 'Energy Supply per Capita', '% Renewable', '2006', '2007', '2008', '2009', '2010', '2011', '2012', 
# '2013', '2014', '2015'].

req_cols = ['Country Name', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']
GDP = GDP[req_cols]
# print(GDP.keys())

req_cols = ['Country', 'Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations', 'Citations per document', 'H index']
ScimEn = (ScimEn[req_cols].set_index('Rank')
    .head(15)
)
# print(ScimEn.keys())

req_cols = ['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']
energy = energy[req_cols]
# print(energy.keys())

# Joining datasets
# pd.merge(left=ScimEn, right=GDP, left_on='Country', right_on='Country Name', how='inner')

