<a href="https://colab.research.google.com/github/swilsonmfc/pandas/blob/main/1_Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Analysis & Pandas

![](https://images-na.ssl-images-amazon.com/images/I/51cUNf8zukL._SX379_BO1,204,203,200_.jpg)

# About Me (Steve Wilson)
* Boise State University (Finance)
* Java Enterprise Developer @ Sight-n-Sound Software
* Management Roles @ FlightStats (CTO, CPO)
* Data Science 2013-2014
* FlightStats Acquired 2016 by FlightGlobal
* Data Scientist / Machine Learning Engineering @ Cirium
* Find me on Teams!

![](https://cf-images.us-east-1.prod.boltdns.net/v1/static/5615998029001/dbf897f7-4494-4e33-a280-fb920ebe59a4/85af29e6-725a-42d2-8cf6-e1343cb0f234/1280x720/match/image.jpg)

# Class Arrangement
* Introduction (Why Pandas)
  * Analytics in Pure Python
  * Hello Pandas
  * Analytics in Pandas
* Weekly
  * Work through a Dataset 
  * Homework Assignment
  * Solution Review 
* Gain skills:
  * Pandas & Numpy fundamentals
  * Work with different types of data
  * Transform data to answer questions
  * Visualizing & communicating
* Useful topics
  * Performance & Pandas
  * Time Series 
  * Geospatial 
  * Text processing
  * Regression
* What to Expect
  * Pythonic code (where it makes sense)
  * Deviate for explanatory purposes
  * An occassional dog barking

# Install

# Setup

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import csv
from pprint import pprint
from collections import defaultdict

In [9]:
pd.__version__

'1.1.5'

# Data
* We're going to work with the freely available simplemaps world city dataset
* Our goal for this dataset is to 
  * Obtain / download it
  * Read it in
  * Compute a grand total of population
  * Compute a by country total of population
* Compare / Contrast Pure Python vs Pandas
* If you're running the notebook on your own Jupyter Environment:
  * You may not have wget - substitute your own command (i.e. Curl)
  * Download the file from to your local folder 


In [10]:
!wget https://simplemaps.com/static/data/world-cities/basic/simplemaps_worldcities_basicv1.73.zip
!unzip simplemaps_worldcities_basicv1.73.zip

--2021-03-19 11:21:33--  https://simplemaps.com/static/data/world-cities/basic/simplemaps_worldcities_basicv1.73.zip
Resolving simplemaps.com (simplemaps.com)... 104.26.13.95, 104.26.12.95, 172.67.71.113, ...
Connecting to simplemaps.com (simplemaps.com)|104.26.13.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2697031 (2.6M) [application/zip]
Saving to: ‘simplemaps_worldcities_basicv1.73.zip’


2021-03-19 11:21:33 (30.0 MB/s) - ‘simplemaps_worldcities_basicv1.73.zip’ saved [2697031/2697031]

Archive:  simplemaps_worldcities_basicv1.73.zip
  inflating: license.txt             
  inflating: worldcities.csv         
  inflating: worldcities.xlsx        


# Pure Python - Analytics

![](https://luciano.defalcoalfano.it/media/images/python-logo-generic.svg)

## Task
* Read in header & first 5 lines of the file
* Compute a total of the population
* Compute a group total of population by country
* Return the top 5 populous countries

## Read Data - File Reader
* Use the file interface in a context manager
* The file reader yields back a string per line
* We would need to parse each row to extract data 
  * Could use a regular expression
  * Could use split and index by position to extract
  * We have to handle quotes, delimiters and types

In [11]:
lines = 0

with open('worldcities.csv', 'r') as file:
  for line in file:
    print(line)
  
    lines += 1
    if lines > 5:
      break

"city","city_ascii","lat","lng","country","iso2","iso3","admin_name","capital","population","id"

"Tokyo","Tokyo","35.6897","139.6922","Japan","JP","JPN","Tōkyō","primary","37977000","1392685764"

"Jakarta","Jakarta","-6.2146","106.8451","Indonesia","ID","IDN","Jakarta","primary","34540000","1360771077"

"Delhi","Delhi","28.6600","77.2300","India","IN","IND","Delhi","admin","29617000","1356872604"

"Mumbai","Mumbai","18.9667","72.8333","India","IN","IND","Mahārāshtra","admin","23355000","1356226629"

"Manila","Manila","14.5958","120.9772","Philippines","PH","PHL","Manila","primary","23088000","1608618140"



In [12]:
extract = line.replace('"', '').split(',')
print(f'{extract[1]} = {int(extract[9]):,}')

Manila = 23,088,000


In [13]:
# More pythonic - if you're that kind of person
with open('worldcities.csv', 'r') as file:
  lines = [next(file) for x in range(5)]
pprint(lines)

['"city","city_ascii","lat","lng","country","iso2","iso3","admin_name","capital","population","id"\n',
 '"Tokyo","Tokyo","35.6897","139.6922","Japan","JP","JPN","Tōkyō","primary","37977000","1392685764"\n',
 '"Jakarta","Jakarta","-6.2146","106.8451","Indonesia","ID","IDN","Jakarta","primary","34540000","1360771077"\n',
 '"Delhi","Delhi","28.6600","77.2300","India","IN","IND","Delhi","admin","29617000","1356872604"\n',
 '"Mumbai","Mumbai","18.9667","72.8333","India","IN","IND","Mahārāshtra","admin","23355000","1356226629"\n']


## Read Data - CSV Reader
* Expand on file reader by parsing the CSV in each row
* CSV Reader yields back an array of data per line
* Improvement - We can index into the array by value to find population and country

In [14]:
lines = 0

with open('worldcities.csv', 'r') as file:
  reader = csv.reader(file, delimiter=',', quotechar='"')
  for line in reader:
    print(line)

    lines += 1
    if lines > 5:
      break

['city', 'city_ascii', 'lat', 'lng', 'country', 'iso2', 'iso3', 'admin_name', 'capital', 'population', 'id']
['Tokyo', 'Tokyo', '35.6897', '139.6922', 'Japan', 'JP', 'JPN', 'Tōkyō', 'primary', '37977000', '1392685764']
['Jakarta', 'Jakarta', '-6.2146', '106.8451', 'Indonesia', 'ID', 'IDN', 'Jakarta', 'primary', '34540000', '1360771077']
['Delhi', 'Delhi', '28.6600', '77.2300', 'India', 'IN', 'IND', 'Delhi', 'admin', '29617000', '1356872604']
['Mumbai', 'Mumbai', '18.9667', '72.8333', 'India', 'IN', 'IND', 'Mahārāshtra', 'admin', '23355000', '1356226629']
['Manila', 'Manila', '14.5958', '120.9772', 'Philippines', 'PH', 'PHL', 'Manila', 'primary', '23088000', '1608618140']


In [15]:
extract = line
extract

['Manila',
 'Manila',
 '14.5958',
 '120.9772',
 'Philippines',
 'PH',
 'PHL',
 'Manila',
 'primary',
 '23088000',
 '1608618140']

In [16]:
print(f'{extract[1]} = {int(extract[9]):,}')

Manila = 23,088,000


## Read Data - Dict Reader
* The Dict Reader offers a further improvement
* Note our header row has been consumed and used to give us a name-value dictionary for each row
* We don't need to 

In [17]:
lines = 0

with open('worldcities.csv', 'r') as file:
  reader = csv.DictReader(file, delimiter=',', quotechar='"')
  for line in reader:
    print(line)

    lines += 1
    if lines > 4:
      break

OrderedDict([('city', 'Tokyo'), ('city_ascii', 'Tokyo'), ('lat', '35.6897'), ('lng', '139.6922'), ('country', 'Japan'), ('iso2', 'JP'), ('iso3', 'JPN'), ('admin_name', 'Tōkyō'), ('capital', 'primary'), ('population', '37977000'), ('id', '1392685764')])
OrderedDict([('city', 'Jakarta'), ('city_ascii', 'Jakarta'), ('lat', '-6.2146'), ('lng', '106.8451'), ('country', 'Indonesia'), ('iso2', 'ID'), ('iso3', 'IDN'), ('admin_name', 'Jakarta'), ('capital', 'primary'), ('population', '34540000'), ('id', '1360771077')])
OrderedDict([('city', 'Delhi'), ('city_ascii', 'Delhi'), ('lat', '28.6600'), ('lng', '77.2300'), ('country', 'India'), ('iso2', 'IN'), ('iso3', 'IND'), ('admin_name', 'Delhi'), ('capital', 'admin'), ('population', '29617000'), ('id', '1356872604')])
OrderedDict([('city', 'Mumbai'), ('city_ascii', 'Mumbai'), ('lat', '18.9667'), ('lng', '72.8333'), ('country', 'India'), ('iso2', 'IN'), ('iso3', 'IND'), ('admin_name', 'Mahārāshtra'), ('capital', 'admin'), ('population', '23355000'),

In [18]:
extract = line
extract

OrderedDict([('city', 'Manila'),
             ('city_ascii', 'Manila'),
             ('lat', '14.5958'),
             ('lng', '120.9772'),
             ('country', 'Philippines'),
             ('iso2', 'PH'),
             ('iso3', 'PHL'),
             ('admin_name', 'Manila'),
             ('capital', 'primary'),
             ('population', '23088000'),
             ('id', '1608618140')])

In [19]:
print(f"{extract['city_ascii']} = {int(extract['population']):,}")

Manila = 23,088,000


## Sum Population
* Using a DictReader compute the total

In [20]:
# Read and sum using a DictReader
population = 0
with open('worldcities.csv', 'r') as file:
  reader = csv.DictReader(file, delimiter=',', quotechar='"')
  for line in reader:
    population += int(line['population'])
print(f'Population = {population:,}')

ValueError: ignored

## Sum Population (With Nulls)
* Looks like we have a row(s) with population missing

In [21]:
# Handle missing data with an if
population = 0
with open('worldcities.csv', 'r') as file:
  reader = csv.DictReader(file, delimiter=',', quotechar='"')
  for line in reader:
    population += 0 if line['population'] == '' else int(line['population'])
print(f'Population = {population:,}')

ValueError: ignored

## Sum Population (With Floats)
* And we hit another problem
* We're expecting an integer, but receive a float
* We need to parse float then int
* We get an answer

In [22]:
# Handle casting to float then to int
population = 0
with open('worldcities.csv', 'r') as file:
  reader = csv.DictReader(file, delimiter=',', quotechar='"')
  for line in reader:
    population += 0 if line['population'] == '' else int(float(line['population']))
print(f'Population = {population:,}')

Population = 4,155,400,545


## Population by Country

In [23]:
country_population = defaultdict(int)

with open('worldcities.csv', 'r') as file:
  reader = csv.DictReader(file, delimiter=',', quotechar='"')
  for line in reader:
    country_population[line['country']] += 0 if line['population'] == '' else int(float(line['population']))

for key, value in country_population.items():
  print(f'{key} = {value:,}')

Japan = 148,273,773
Indonesia = 85,283,563
India = 270,170,371
Philippines = 62,547,674
China = 1,388,868,247
Brazil = 135,416,434
Korea, South = 58,770,399
Mexico = 103,464,676
Egypt = 38,873,354
United States = 400,521,452
Russia = 105,990,872
Thailand = 23,598,357
Argentina = 36,192,503
Bangladesh = 23,288,987
Nigeria = 54,771,880
Turkey = 70,958,511
Pakistan = 56,952,881
Iran = 48,493,926
Congo (Kinshasa) = 27,429,409
Vietnam = 37,934,404
France = 38,429,395
United Kingdom = 72,555,282
Peru = 21,273,383
Colombia = 28,491,631
Angola = 11,609,969
Malaysia = 19,509,922
Hong Kong = 7,347,000
Sudan = 15,752,339
Chile = 13,775,014
Saudi Arabia = 22,327,128
Tanzania = 11,276,325
Iraq = 18,624,715
Singapore = 5,745,000
Kenya = 11,135,325
Burma = 12,306,878
Canada = 37,747,786
Australia = 23,563,278
Côte D’Ivoire = 8,012,496
Spain = 22,741,191
South Africa = 15,430,094
Morocco = 16,986,585
Jordan = 6,057,680
Afghanistan = 7,474,525
Germany = 58,243,288
Algeria = 10,828,214
Bolivia = 9,579,0

## Most Populous Countries

In [24]:
# Chain array operations (pick top & reverse sort)
grp = sorted(country_population.items(), key=lambda kv: kv[1])[-5:][::-1]
for country, population in grp:
  print(f'{country} {population:,}')

China 1,388,868,247
United States 400,521,452
India 270,170,371
Japan 148,273,773
Brazil 135,416,434


## Notes
* Our pure python approach 
  * Wrestling with and reading files 
  * Deciding on a reader class 
  * Missing data & different types
  * Writing loops to read data & compute aggregates
* More productive ways to perform analysis
  * If you've done a lot of python programming not SO bad, but slow
  * If you're getting starting it's a heavy lift to be productive

# Introducing Pandas
* Pandas is an open-source library
* Provides high-performance data manipulation
* The name harks from Panel Data
* Developed in 2008 - Wes McKinney

## DataFrames
* DataFrames are one of two primary objects in Pandas
* DataFrames are a 2D representation of data
  * Methods for I/O Reading and Writing 
  * Classes for structure & indexing (rows and columns)
  * Analytical methods and helpers built in
  * Transformational support
  * Visualization & plotting

## I/O Reading
* Pandas is capable of reading many different formats
* Concise - One line returned back a nice table structure for the data
* Here we'll read from a csv, but you can easily ingest JSON, Excel, Parquet, Pickle, Arff and other storage formats

In [25]:
df = pd.read_csv('worldcities.csv')

In [26]:
# Top n rows (default 5)
df.head()

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
0,Tokyo,Tokyo,35.6897,139.6922,Japan,JP,JPN,Tōkyō,primary,37977000.0,1392685764
1,Jakarta,Jakarta,-6.2146,106.8451,Indonesia,ID,IDN,Jakarta,primary,34540000.0,1360771077
2,Delhi,Delhi,28.66,77.23,India,IN,IND,Delhi,admin,29617000.0,1356872604
3,Mumbai,Mumbai,18.9667,72.8333,India,IN,IND,Mahārāshtra,admin,23355000.0,1356226629
4,Manila,Manila,14.5958,120.9772,Philippines,PH,PHL,Manila,primary,23088000.0,1608618140


In [27]:
# Bottom n rows (default 5)
df.tail()

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
26564,Nord,Nord,81.7166,-17.8,Greenland,GL,GRL,Sermersooq,,10.0,1304217709
26565,Timmiarmiut,Timmiarmiut,62.5333,-42.2167,Greenland,GL,GRL,Kujalleq,,10.0,1304206491
26566,Cheremoshna,Cheremoshna,51.3894,30.0989,Ukraine,UA,UKR,Kyyivs’ka Oblast’,,0.0,1804043438
26567,Ambarchik,Ambarchik,69.651,162.3336,Russia,RU,RUS,Sakha (Yakutiya),,0.0,1643739159
26568,Nordvik,Nordvik,74.0165,111.51,Russia,RU,RUS,Krasnoyarskiy Kray,,0.0,1643587468


## Structure
* Pandas made some inferences on the type of our data
* It converted population from a object (string) into a float64
* The info method gives up details on the data
  * Position, Name, Non-Null & Type
  * Note: We have 3 null populations
* The shape parameter tells us the number of Rows and Columns

In [28]:
# Info tells us a lot of information about the structure of our dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26569 entries, 0 to 26568
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   city        26569 non-null  object 
 1   city_ascii  26569 non-null  object 
 2   lat         26569 non-null  float64
 3   lng         26569 non-null  float64
 4   country     26569 non-null  object 
 5   iso2        26538 non-null  object 
 6   iso3        26569 non-null  object 
 7   admin_name  26493 non-null  object 
 8   capital     7626 non-null   object 
 9   population  25596 non-null  float64
 10  id          26569 non-null  int64  
dtypes: float64(3), int64(1), object(7)
memory usage: 2.2+ MB


In [29]:
# The deep=True parameter can tell us more about the memory footprint of the DataFrame
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26569 entries, 0 to 26568
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   city        26569 non-null  object 
 1   city_ascii  26569 non-null  object 
 2   lat         26569 non-null  float64
 3   lng         26569 non-null  float64
 4   country     26569 non-null  object 
 5   iso2        26538 non-null  object 
 6   iso3        26569 non-null  object 
 7   admin_name  26493 non-null  object 
 8   capital     7626 non-null   object 
 9   population  25596 non-null  float64
 10  id          26569 non-null  int64  
dtypes: float64(3), int64(1), object(7)
memory usage: 11.9 MB


In [30]:
# Shape is important to know, it gives us the dimensions of the dataset
df.shape

(26569, 11)

In [31]:
# For numeric values, describe gives us a summary of the data and its distribution
df.describe()

Unnamed: 0,lat,lng,population,id
count,26569.0,26569.0,25596.0,26569.0
mean,33.095264,-11.36386,162345.7,1556097000.0
std,22.393678,73.946817,899658.5,287389000.0
min,-54.9341,-179.59,0.0,1004003000.0
25%,27.9183,-78.7794,9246.0,1276656000.0
50%,40.2188,-0.7689,20079.5,1643148000.0
75%,47.9878,29.6833,59369.25,1840005000.0
max,81.7166,179.3667,37977000.0,1934000000.0


## Columns

In [32]:
# How many columns are there?
len(df.columns)

11

In [33]:
# What's the name of the columns
df.columns

Index(['city', 'city_ascii', 'lat', 'lng', 'country', 'iso2', 'iso3',
       'admin_name', 'capital', 'population', 'id'],
      dtype='object')

In [34]:
# Subselect a frame
df[['city', 'country']]

Unnamed: 0,city,country
0,Tokyo,Japan
1,Jakarta,Indonesia
2,Delhi,India
3,Mumbai,India
4,Manila,Philippines
...,...,...
26564,Nord,Greenland
26565,Timmiarmiut,Greenland
26566,Cheremoshna,Ukraine
26567,Ambarchik,Russia


## Rows

In [35]:
# How many rows are there?
len(df)

26569

In [36]:
# What are the row labels / type?
df.index

RangeIndex(start=0, stop=26569, step=1)

In [37]:
# Selecting a subset of rows
df[1:5]

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
1,Jakarta,Jakarta,-6.2146,106.8451,Indonesia,ID,IDN,Jakarta,primary,34540000.0,1360771077
2,Delhi,Delhi,28.66,77.23,India,IN,IND,Delhi,admin,29617000.0,1356872604
3,Mumbai,Mumbai,18.9667,72.8333,India,IN,IND,Mahārāshtra,admin,23355000.0,1356226629
4,Manila,Manila,14.5958,120.9772,Philippines,PH,PHL,Manila,primary,23088000.0,1608618140


## Series
* Series are the second major object in Pandas
* Pandas uses Series to represent columns / rows
* A series is a 1D NDArray of data with a type
  * object
  * scalar
* Series have an index (label) and a value
* Easy interchange between dicts, lists and arrays
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

In [38]:
# Turn a list into a Series
data = [3, 6, 9]
series = pd.Series(data)
series

0    3
1    6
2    9
dtype: int64

In [39]:
# Turn a list into a series with label names
data  = [3, 6, 9]
index = ['Item1', 'Item2', 'Item3']
series = pd.Series(data, index=index)
series

Item1    3
Item2    6
Item3    9
dtype: int64

## Numpy
* A series of values is implemented with a Numpy Array
* Numpy is a library built for fast computations
  * Efficient linear algebra computations
  * Quick operations over dimensions of an array (sum, std, min, max)
* Your Pandas experience will leverage the features of Numpy

In [40]:
# Show examples of computations
my_array = np.array([1, 2, 3, 4, 5, 6])
print(f'Sum  = {my_array.sum()}')
print(f'Mean = {my_array.mean()}')
print(f'Max  = {my_array.max()}')

Sum  = 21
Mean = 3.5
Max  = 6


# Pandas - Analytics

![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/1200px-Pandas_logo.svg.png)

## Read Data

In [41]:
# One liner to read
df = pd.read_csv('worldcities.csv')

## Sum Population
* We can refer to columns in two ways
* Column
* Named attribute
  * Name can't have a space or invalid characters
  * Shorthand for column accessor

In [42]:
# Column Accessor
print(f"{df['population'].sum():,}")

4,155,400,546.559


In [43]:
# Named Attribute
print(f"{df.population.sum():,}")

4,155,400,546.559


## Population by Country

In [44]:
# Groupby country, sum on population column
df.groupby('country')['population'].sum()

country
Afghanistan           7474525.0
Albania               1691769.0
Algeria              10828214.0
American Samoa          12576.0
Andorra                 22151.0
                        ...    
Wallis And Futuna           0.0
West Bank                   0.0
Yemen                 6674466.0
Zambia                4744736.0
Zimbabwe              3661602.0
Name: population, Length: 224, dtype: float64

## Most Populous Countries

In [45]:
# Group by country, sum on population, sort descending, top 5
grp = df.groupby('country')['population'].sum().sort_values(ascending=False)[:5]
for country, population in grp.items():
  print(f"{country} {int(population):,}")

China 1,388,868,247
United States 400,521,452
India 270,170,371
Japan 148,273,773
Brazil 135,416,434


## Notes
* Pandas is a game changer in terms of productivity
  * Easy to perform I/O
  * Basic analysis across rows / columns built in
  * Aggregates & groupings are part of the library
* The library is fluid (can chain together operators -- a little like R's dplyr)

# Assignment
* Read the file worldcities.csv 
* Produce summary information
  * How many cities are there?
  * How many countries are there?
* Produce summary statistics by country
  * Count of cities in each country
  * Min population in country
  * Max population in country
* Extra: There are three cities that have an incorrect population (Can you find them?)
* File is named 1_Assignment.ipynb

# Wrap Up
* Thanks for coming!
* Try the homework assignment!
* Hope to see you in two weeks!
  * Review Solution 
  * Python Fundamentals (DataFrames & Indexes & Slicing)

![](https://cms-assets.themuse.com/media/lead/_1200x630_crop_center-center_82_none/ba465624-967c-46b9-aafc-96d6beaae6ae.jpg?mtime=1570206105)