<a href="https://colab.research.google.com/github/ysh2272/learningstatistics/blob/master/weathertrends.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1. Import Data

**1a. CSV from URL**

**1b. Type Correction**

**1c. Set Index**

**Summary:**

* Name
* Count of Records
* Column Names

In [33]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

globalurl = 'https://raw.githubusercontent.com/ysh2272/learningstatistics/master/global.csv'
globaldf = pd.read_csv(globalurl, header=0)
globaldf.dtypes
yearglobal = globaldf['year'].astype(int) 
tempglobal = globaldf['avg_temp'].astype(float)
globaldf.set_index('year',inplace=True)

nyurl = 'https://raw.githubusercontent.com/ysh2272/learningstatistics/master/newyork.csv'
nydf = pd.read_csv(nyurl, header=0)
yearny = nydf['year'].astype(int)
tempny = nydf['avg_temp'].astype(float)
nydf.set_index('year',inplace=True)

print('globaldf has {} records of year, global'.format(len(globaldf)))
print('nydf has {} records of year, newyork'.format(len(nydf)))

globaldf has 266 records of year, global
nydf has 271 records of year, newyork


#2. Data Wrangling

**2a. Check Missing Records**

**2b. Check Duplicate Records**

**2c. Index Comparison**

**2d. Outer JOIN**

**2e. Assign Index**

**Summary:**

* Name of DataFrame
* Name and Range of Index
* Name of Columns 


In [0]:
newyork = nydf['year']
world = worlddf['year']

nyyears =  [i for i in range(newyork.min(),newyork.max())]
worldyears =  [i for i in range(world.min(),world.max())]
worldmissing = set(worldyears)-set(world)
nymissing = set(nyyears)-set(newyork)
print('Missing Records ({} in worlddf, {} in nydf)'.format(len(worldmissing),len(nymissing)))

nyduplicate = newyork.duplicated()
worldduplicate = world.duplicated()
print('Duplicate Records ({} in worlddf, {} in nydf)'.format(sum(worldduplicate),sum(nyduplicate)))

ny_world = set(nyyears)-set(worldyears)
print('nydf has {} more years than worlddf: {}'.format(len(ny_world),ny_world))
world_ny = set(worldyears)-set(nyyears)
print('worlddf has {} more years than nydf: {}'.format(len(world_ny),world_ny))

outerdf = nydf.merge(worlddf, how='outer') 
outerdf.set_index('year',inplace=True)

print('outerdf has {} records along year index for newyork, world'.format(len(outerdf)))

Missing Records (0 in worlddf, 0 in nydf)
Duplicate Records (0 in worlddf, 0 in nydf)
nydf has 7 more years than worlddf: {1743, 1744, 1745, 1746, 1747, 1748, 1749}
worlddf has 2 more years than nydf: {2013, 2014}
outerdf has 273 records along year index for newyork, world


#3. Missing Data

**3a. Null Index Location**

**3b. Null Type Checking**

**3c. Listwise Deletion**

**3d. Pairwise Deletion**

**Summary:**

* Count Records 
* Count NaN

In [0]:
newyork = outerdf['newyork']
world = outerdf['world']

nynull = newyork[newyork.isnull()]
nynullindex = nynull.index.to_list()
print('{} {} NaNs at {} in nydf'.format(len(nynull),type(nynull.iloc[0]),nynullindex))

worldnull = world[world.isnull()]
worldnullindex = worldnull.index.to_list()
print('{} {} NaNs at {} in worlddf'.format(len(worldnull),type(worldnull.iloc[0]),worldnullindex))

listwise = outerdf.drop(worldnullindex)
print('listwise deletion at {}'.format(worldnullindex))
world = listwise['world']

pairwiseindex = listwise.index[-2:].to_list()
pairwise = listwise['newyork'].iloc[:-2]
print('pairwise deletion at {} in newyork'.format(pairwiseindex))
newyork = pairwise
newyorkindex = listwise['newyork'].index[:-2]

nynullindex = newyork[newyork.isnull()].index.to_list()
worldnullindex = world[world.isnull()].index.to_list()
print('{} records and {} NaN in newyork: {}'.format(len(newyork),len(nynullindex),nynullindex))
print('{} records and {} NaN in world'.format(len(world),len(worldnullindex)))

7 <class 'numpy.float64'> NaNs at [1746, 1747, 1748, 1749, 1780, 2014, 2015] in nydf
7 <class 'numpy.float64'> NaNs at [1743, 1744, 1745, 1746, 1747, 1748, 1749] in worlddf
listwise deletion at [1743, 1744, 1745, 1746, 1747, 1748, 1749]
pairwise deletion at [2014, 2015] in newyork
264 records and 1 NaN in newyork: [1780]
266 records and 0 NaN in world


#4. Exploratory Data Analysis

**4a. Data Visualization** 

**4b. Variation Analysis**

**4c. Outlier Analysis**

**Summary**

* Outlier Locations
