<h1>Fundamentals of Data Science, Introduction week 2</h1>

**GOAL** In this notebook we are going to cover the following practical aspects of data science:

- Reading a csv file and loading it to a dataframe in python using pandas library
- Filtering out the required columns in the dataframe
- Summarising data based on the fields. Ex: Summing up all the rows corresponding to a certain entry in the dataset
- Plot shape of United States using the geographic data i.e. data with all the coordinates
- Scale and move the states using data of the coordinates
- Colour the states based on the average age of their population

To complete this assignment you need to have a running Anaconda installation with Python 3.6 or 3.7 on your device. Python package prerequisites include:
+  **pandas**
+  **gdal**
+  **shapely**
+  **descartes**

To get these packages installed type 'conda install {package}' into the terminal (linux) or the Anaconda prompt (Windows), for each package you don't have. Gdal can be a bit fussy, there is an alternative install method at the end of this notebook.

**Pandas** is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

Pandas is well suited for many different kinds of data:

- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. 

In [None]:
# Import all the libraries required
import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from matplotlib.colors import rgb2hex
from descartes import PolygonPatch
from shapely.geometry import Polygon, MultiPolygon

<h3>A look at the data</h3>
Throughout this assignment, the dataset that we would be using will be the US population statistics 2010-2016. It can be downloaded from https://www2.census.gov/programs-surveys/popest/datasets/2010-2016/state/asrh/.

As a first step, we will structure the data into a pandas DataFrame to simplify the data manipulation. We will start by loading the csv file of data into a DataFrame called population_data and we will then filter columns based on our use. *Take a look at the data and its fields*

In [None]:
population_data = pd.read_csv('sc-est2016-agesex-civ.csv')
# For viewing the complete dataset
population_data

In [None]:
population_data.shape

The complete dataset has 13,572 rows and 15 columns. It can be verified by looking at the shape of your dataframe by <i>population_data.shape</i>. To select specific columns, create a list of column names and view the dataframe as:

In [None]:
# Filtering the data by the columns ['NAME','SEX','AGE','POPEST2016_CIV']
population_data[['NAME','SEX','AGE','POPEST2016_CIV']]

In [None]:
# Filters out data for all sexes and all age group and store in a new dataframe 'population_data_all'
population_data_all = population_data[population_data['SEX']==0]
population_data_all = population_data_all[population_data_all['AGE']!=999]

In [None]:
# Sum the population of each state for each year on the dataset 'population_data_all'
population_sums = population_data_all.groupby(by=['NAME'], as_index=False)[['POPEST2010_CIV','POPEST2011_CIV','POPEST2012_CIV',
                                              'POPEST2013_CIV','POPEST2014_CIV','POPEST2015_CIV','POPEST2016_CIV']].sum()

<h3> Get a US states geojson file </h3>        
Sources
- ArcGIS shapefile of US 50 states + DC (https://www.arcgis.com/home/item.html?id=f7f805eb65eb4ab787a0a3e1116ca7e5)
<h4> Recommended – convert to GeoJSON </h4>   
<i> ogr2ogr  -f  GeoJSON  [name1].shp [name2].geojson   </i> - where, 'name1' is the name of the downloaded shp file and 'name2' is the name you want to specify for the geojson file

** NOTE: ** If you are having trouble with this, or you are using Windows, then you can also use https://mapshaper.org/ to convert between formats. During the upload to mapshaper you need to upload <u>all</u> the files from the states_21basic.zip file before conversion. Otherwise you will miss relevant features. 

In [None]:
S_DIR = r'C:\Users\A\Downloads' 
BLUE = '#5599ff'

with open(os.path.join(S_DIR, 'states.json')) as rf:    
    data = json.load(rf)

fig = plt.figure() 
ax = fig.gca()
for feature in data['features']:
    geometry = feature['geometry']
    if geometry['type'] == 'Polygon':
        poly = geometry
        ppatch = PolygonPatch(poly, fc=BLUE, ec=BLUE,  alpha=0.5, zorder=2)
        ax.add_patch(ppatch)
    elif geometry['type'] == 'MultiPolygon':
        for polygon in geometry['coordinates'][0]:
            poly = Polygon(polygon)
            ppatch = PolygonPatch(poly, fc=BLUE, ec=BLUE, alpha=0.5, zorder=2)
            ax.add_patch(ppatch)
    else:
        print('Don\'t know how to draw :', geometry['type'])

ax.axis('scaled')
plt.axis('off')
plt.show()

**TASKS** <font color="red">
<h3> Improve the map </h3>
    
1) Try a different projection (example: US Census Bureau shapefile)
    
2) Resize and move Alaska 

3) Color the map based on the average age of each state for the year 2016 


<h3> Installations for GDAL </h3>

** IMPORTANT NOTE ** Only follow these if 'conda install gdal' does not work

GDAL - Geospatial Data Abstraction Library <br>
    http://www.gdal.org/index.html

For example,
** On Linux Fedora:** 
         <ul><i>yum install libpng</i>   
         <i> yum install libtiff </i>  
         <i> sudo dnf install gdal gdal-devel</i><br>
         
** In Ubuntu:**
         <ul><i>sudo add-apt-repository ppa:ubuntugis/ppa && sudo apt-get update</i>   
         <i> sudo apt-get install gdal-bin </i>  
         <i> To verify after installation, try: $ ogrinfo</i><br>
         <i> If the installation was successful, you will see something like this:

+ Usage: ogrinfo [--help-general] [-ro] [-q] [-where restricted_where]
               [-spat xmin ymin xmax ymax] [-fid fid]
               [-sql statement] [-al] [-so] [-fields={YES/NO}]
               [-geom={YES/NO/SUMMARY}][--formats]
               datasource_name [layer [layer ...]]
</i><br>
        
**In Windows:**
         Refer to https://sandbox.idre.ucla.edu/sandbox/tutorials/installing-gdal-for-windows   