<a href="https://colab.research.google.com/github/ysh2272/learningstatistics/blob/master/weathertrends.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1. Import Data

a) CSV from URL

b) Type Correction

c) Set Index

**Summary:**

* Name
* Count of Records
* Column Names

In [32]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

worldurl = 'https://raw.githubusercontent.com/ysh2272/learningstatistics/master/global.csv'
worlddf = pd.read_csv(worldurl, header=0)
worlddf.dtypes
worlddf['year'].astype(int) 
worlddf['avg_temp'].astype(float)

nyurl = 'https://raw.githubusercontent.com/ysh2272/learningstatistics/master/newyork.csv'
nydf = pd.read_csv(nyurl, header=0)
nydf['year'].astype(int)
nydf['avg_temp'].astype(float)

print('worlddf has {} records of avg_temp along year'.format(len(worlddf)))
print('nydf has {} records of avg_temp along year'.format(len(nydf)))

worlddf has 266 records of avg_temp along year
nydf has 271 records of avg_temp along year


#2. Data Wrangling

a) Check Missing Records

b) Check Duplicate Records

c) Index Comparison

d) Outer JOIN

**Summary:**

* Name of DataFrame
* Name and Range of Index
* Name of Columns 


In [40]:
newyork = nydf['year']
world = worlddf['year']
nyyears =  [i for i in range(newyork.min(),newyork.max())]
worldyears =  [i for i in range(world.min(),world.max())]
worldmissing = set(worldyears)-set(world)
nymissing = set(nyyears)-set(newyork)
print('Missing Records ({} in worlddf, {} in nydf)'.format(len(worldmissing),len(nymissing)))

nyduplicate = newyork.duplicated()
worldduplicate = world.duplicated()
print('Duplicate Records ({} in worlddf, {} in nydf)'.format(sum(worldduplicate),sum(nyduplicate)))

ny_world = set(nyyears)-set(worldyears)
print('nydf has {} more years than worlddf: {}'.format(len(ny_world),ny_world))
world_ny = set(worldyears)-set(nyyears)
print('worlddf has {} more years than nydf: {}'.format(len(world_ny),world_ny))

outerdf = nydf.merge(worlddf, on='year', how='outer')
outerdf.head()
outerdf.rename(columns={'avg_temp_x':'newyork','avg_temp_y':'world'},inplace=True) 
outerdf.set_index('year',inplace=True)
print(outerdf.head())

Missing Records (0 in worlddf, 0 in nydf)
Duplicate Records (0 in worlddf, 0 in nydf)
nydf has 7 more years than worlddf: {1743, 1744, 1745, 1746, 1747, 1748, 1749}
worlddf has 2 more years than nydf: {2013, 2014}
      newyork  world
year                
1743     3.26    NaN
1744    11.66    NaN
1745     1.13    NaN
1746      NaN    NaN
1747      NaN    NaN


#3. Missing Data Handling

a) Null Index Location

b) Null Type Checking

c) Listwise Deletion

d) Pairwise Deletion

**Summary:**

* Count Records 
* Count and Location of NaNs

In [44]:
newyork = outerdf['newyork']
nynull = newyork[newyork.isnull()]
nynullindex = nynull.index.to_list()
print('{} {} NaNs at {} in newyork'.format(len(nynull),type(nynull.iloc[0]),nynullindex))

world = outerdf['world']
worldnull = world[world.isnull()]
worldnullindex = worldnull.index.to_list()
print('{} {} NaNs at {} in world'.format(len(worldnull),type(worldnull.iloc[0]),worldnullindex))

listwise = outerdf.drop(worldnullindex)
print('listwise deletion at {}'.format(worldnullindex))
world = listwise['world']

pairwiseindex = listwise.index[-2:].to_list()
pairwise = listwise['newyork'].iloc[:-2]
print('pairwise deletion at {} in newyork'.format(pairwiseindex))
newyork = pairwise
newyorkindex = listwise['newyork'].index[:-2]

nynullindex = newyork[newyork.isnull()].index.to_list()
worldnullindex = world[world.isnull()].index.to_list()
print('{} records and {} NaN in newyork: {}'.format(len(newyork),len(nynullindex),nynullindex))
print('{} records and {} NaN in world'.format(len(world),len(worldnullindex)))

7 <class 'numpy.float64'> NaNs at [1746, 1747, 1748, 1749, 1780, 2014, 2015] in newyork
7 <class 'numpy.float64'> NaNs at [1743, 1744, 1745, 1746, 1747, 1748, 1749] in world
listwise deletion at [1743, 1744, 1745, 1746, 1747, 1748, 1749]
pairwise deletion at [2014, 2015] in newyork
264 records and 1 NaN in newyork: [1780]
266 records and 0 NaN in world


#4. Exploratory Data Analysis

a) Data Visualization

b) Variation Analysis

c) Outlier Analysis

d) Imputation

* Missing Data
* Outliers

**Summary**

* Count and Location of Imputed Missing Data
* Count and Location of Imputed Outliers
* Distribution w/ Statistics
