# Python Refresher

- Learning: 2~3 students form a small group to run the scripts together.

- About Google Colab
  - How to use Google Colab?
  - https://www.youtube.com/watch?v=inN8seMm7UI
  - https://colab.research.google.com/#scrollTo=5fCEDCU_qrC0

- Section 1: Basic Python syntax
- Section 2: Probability distributions as DGP (Numpy)
- Section 3: Loading data (Pandas)
- Section 4: Viewing data (Numpy and Pandas)
- Section 5: Analyzing data (Numpy, Pandas, and Matplotlib)

- Data: https://drive.google.com/file/d/1LTxyDrSZD2f0LuELMenGVkWeag7KebHP/view?usp=sharing
- Data dictionary: https://docs.google.com/spreadsheets/d/1juSpTKO07HnzxAKRTg5MLLj1SWR60tX3/edit?usp=sharing&ouid=115425849068992643587&rtpof=true&sd=true

For each section, I cannot cover everything. Please find the online tutorials. In fact, Google is always the best friend.

## 1. Basic Python Syntax

In [1]:
# print # anything after a hashtag is a comment
print("Hello world")

Hello world


### 1.1 Basic data types

In [2]:
# booleans
x = True # or false
type(x)

bool

In [3]:
# integers (discrete values)
x = 1
type(x)

int

In [4]:
# floating-point numbers (continuous values)
x = 100.5
type(x)

float

In [5]:
# strings
x = 'Gainesville'
type(x)

str

In [6]:
# list of items
x = [1, 2, 3, 'Gainesville']
type(x)

list

In [7]:
# sets
x = {1, 2, 3, 3, 2}
type(x)

set

In [8]:
# tuple
x = (12, "Urban Analytics")
type(x)

tuple

In [9]:
# dictionary
x = {"first name": "Shenhao", "last name": "Wang", "lab": "Urban AI"}
type(x)

dict

In [10]:
# keys and values in a dictionary
x = {"first name": "Shenhao", "last name": "Wang", "lab": "Urban AI"}
print(x.keys())
print(x.values())

dict_keys(['first name', 'last name', 'lab'])
dict_values(['Shenhao', 'Wang', 'Urban AI'])


### 1.2 Basic algebraic operations

In [11]:
# adding integers
x = 1 + 2
print(type(x))
print(x)

<class 'int'>
3


In [12]:
# adding floats
x = 1.5 + 2.5
print(type(x))
print(x)

<class 'float'>
4.0


In [13]:
# multiplication
x = 3 * 4
print(x)

12


In [14]:
# division
x = 3/4
print(x)

0.75


In [15]:
# exponential operation
x = 2 ** 3
print(x)

8


In [16]:
# order of basic algebraic operations
x = 2 * 3 ** 2
print(x)

18


### 1.3 Boolean operations

In [17]:
# check equality
x = 10 # assign value using =
x == 10 # check equality using ==

True

In [18]:
# check larger than
x = 10
x > 100

False

In [19]:
# True/False; and/or/not
x = True
y = False

In [20]:
x and y

False

In [21]:
x or y

True

In [22]:
not x

False

In [23]:
x and not y

True

### 1.4 String operations

In [24]:
x = "Hello" + "World"
print(x)

HelloWorld


In [25]:
x[0]

'H'

In [26]:
x[2]

'l'

In [27]:
# You cannot add string and float/int types
x + 1.0

TypeError: can only concatenate str (not "float") to str

In [28]:
# but you can multiply them
x * 2

'HelloWorldHelloWorld'

### 1.5 If condition: if, elif, else;

In [29]:
# basic syntax 1
condition = True
if condition:
  print("See me when the condition is true")

See me when the condition is true


In [30]:
# example
z = 4
if z - 2 == 2:
  print("z is four")

z is four


In [31]:
# basic syntax 2
condition = True
if condition:
  print("See me when the condition is true.")
else:
  print("See me when the condition is false.")

See me when the condition is true.


In [32]:
# example
z = 5
if z%2 == 0: # remainder is zero
  print("z is even")
else: # remainder is one
  print("z is odd")

z is odd


In [33]:
# basic syntax 3: if-elif-else
# example
z = 3
if z%2 == 0:
  print("z is divisible by 2") # False
elif z%3 == 0:
  print("z is divisible by 3") # True
else:
  print("z is neither divisible by 2 nor by 3")

z is divisible by 3


### 1.6 For Loop

In [34]:
# example 1
# print every string
x = "Hello World"
for letter in x:
  print(letter)

H
e
l
l
o
 
W
o
r
l
d


In [35]:
# example 2
# adding values
initial_value = 0
for i in range(4): # range(4): 0, 1, 2, 3
  initial_value = initial_value + i
  print(initial_value)

0
1
3
6


In [36]:
# example 3
# print the list
my_list = ['Orlando', 'Gainesville', 'Miami', 'Tampa Bay', 'Jacksonville']
for item in my_list:
  print(item)

Orlando
Gainesville
Miami
Tampa Bay
Jacksonville


### 1.7 Functions

In [37]:
# example 1.
def add_one(x):
  '''
  Add one to the value x
  '''
  y = x + 1
  return y

# test
x = 5
y = add_one(x)
print(y)

6


## 2. Probability distributions as DGP

In [38]:
# import numpy
import numpy as np



In [39]:
# set up parameter for Bernoulli distribution
np.random.binomial(size = 10, n = 1, p = 0.6)

array([1, 0, 1, 1, 0, 0, 0, 0, 0, 1])

In [40]:
# set up parameter for Binomial distribution
np.random.binomial(size = 10, n = 5, p = 0.6)

array([3, 4, 3, 3, 4, 2, 3, 3, 3, 3])

In [41]:
# set up parameter for normal distribution
np.random.normal(0, 1, 10)

array([-0.19207072,  1.17084232, -0.04219169, -0.94849387, -1.54247811,
        1.99373111, -1.7452171 , -2.2519258 ,  0.54642146,  0.72609794])

In [42]:
# set up parameter for normal distribution
# Let's choose some mean and variance to simulate the age vector.
v = np.random.normal(44, 10, 20) # mu = 44, sigma = 10, size = 20
v = np.round_(v, 1)
v

array([38.1, 37.1, 50.5, 52.9, 39.9, 37. , 40. , 20.7, 33.7, 36.5, 58.8,
       47.2, 46.3, 38.7, 59.4, 44.5, 36.8, 55.8, 55.6, 23.3])

## 3. Loading data

- Please create a folder named as data under your Colab Notebooks folder. Then copy and paste the data set (Florida_ct.csv) and the documentation under the data folder.

In [48]:
# import pandas
import pandas as pd

In [49]:
# load the csv file using pandas
df = pd.read_csv('SampleDataset/Florida_ct.csv', index_col = 0)

Unnamed: 0,pop_total,sex_total,sex_male,sex_female,age_median,households,race_total,race_white,race_black,race_native,...,travel_walk_ratio,travel_work_home_ratio,edu_bachelor_ratio,edu_master_ratio,edu_phd_ratio,edu_higher_edu_ratio,employment_unemployed_ratio,vehicle_per_capita,vehicle_per_household,vacancy_ratio
0,2812.0,2812.0,1383.0,1429.0,39.4,931.0,2812.0,2086.0,517.0,0.0,...,0.014815,0.024242,0.183838,0.029798,0.003030,0.216667,0.286635,0.528094,1.595059,0.155938
1,4709.0,4709.0,2272.0,2437.0,34.2,1668.0,4709.0,2382.0,1953.0,0.0,...,0.022150,0.004615,0.135222,0.040245,0.003220,0.178686,0.318327,0.460183,1.299161,0.152869
2,5005.0,5005.0,2444.0,2561.0,34.1,1379.0,5005.0,2334.0,2206.0,224.0,...,0.026141,0.027913,0.213247,0.064620,0.007431,0.285299,0.366755,0.450949,1.636693,0.162211
3,6754.0,6754.0,2934.0,3820.0,31.3,2238.0,6754.0,4052.0,1671.0,326.0,...,0.052697,0.004054,0.093379,0.082510,0.012599,0.188488,0.314452,0.474830,1.432976,0.178716
4,3021.0,3021.0,1695.0,1326.0,44.1,1364.0,3021.0,2861.0,121.0,0.0,...,0.003014,0.013059,0.219868,0.138631,0.007064,0.365563,0.218447,0.659053,1.459677,0.335930
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4162,15742.0,15742.0,7957.0,7785.0,41.0,5517.0,15742.0,13894.0,1128.0,64.0,...,0.000000,0.062212,0.164241,0.084891,0.002066,0.251197,0.401551,0.426820,1.217872,0.099118
4163,5723.0,5723.0,2914.0,2809.0,43.0,2001.0,5723.0,4664.0,482.0,0.0,...,0.017050,0.047581,0.215161,0.084563,0.007005,0.306730,0.411036,0.440678,1.260370,0.039827
4164,10342.0,10342.0,4657.0,5685.0,37.6,3746.0,10342.0,7956.0,1351.0,13.0,...,0.000000,0.038862,0.137002,0.030591,0.002049,0.169643,0.353295,0.482692,1.332621,0.041208
4165,8960.0,8960.0,4166.0,4794.0,37.2,3324.0,8960.0,6286.0,1831.0,0.0,...,0.024021,0.064132,0.174399,0.063014,0.003126,0.240540,0.363482,0.478571,1.290012,0.017440


In [None]:
# check the dataframe
df

This is the data set from US Census. Each row represents a census tract, and each column represents a variable. We could check the columns one by one using numpy and pandas. Further details of the variables can be found in the documentation csv file.

## 4. Viewing data

- Numpy and pandas are the two most common Python packages to process data and describe the summary statistics for data sets.

### 4.1 Data types in numpy and pandas

Before delving into the data set, let's be familiar with the data structures in numpy and pandas.

In [56]:
# import numpy package
# Note: numpy for matrix operation; pandas for data sets.
import numpy as np
import pandas as pd

0     8
1     5
2    77
3     2
dtype: int64

In [57]:
# Numpy array (1D)
array_1 = np.arange(0,5)
array_1

{'hh_income': [71000, 28000, 35000, 120000],
 'home_value': [520000, 275000, 400000, 980000]}

In [58]:
# Numpy array (2D)
# 1. Two printings
# 2. random.randint - randomly generating 20 different values between 1 and 2.
# 3. reshape: reshaping the array to a matrix with 5*4 shape.
array_1D = np.random.randint(1,10,20)
print(array_1D)
array_2D = array_1D.reshape(5, -1)
print(array_2D)

Unnamed: 0,hh_income,home_value
0,71000,520000
1,28000,275000
2,35000,400000
3,120000,980000


In [None]:
# Pandas series
my_list = [8, 5, 77, 2]
my_series = pd.Series(my_list)
my_series

In [59]:
# Pandas dataframe
# a dict can contain multiple lists and label them
my_dict = {'hh_income'  : [71000, 28000, 35000, 120000],
           'home_value' : [520000, 275000, 400000, 980000]}
my_dict

(4167, 88)

In [60]:
# Turn a dictionary to dataframe.
# dataframe is the most common data structure in data analysis in python.
df_s = pd.DataFrame(my_dict)
df_s

Unnamed: 0,pop_total,sex_total,sex_male,sex_female,age_median,households,race_total,race_white,race_black,race_native,...,travel_walk_ratio,travel_work_home_ratio,edu_bachelor_ratio,edu_master_ratio,edu_phd_ratio,edu_higher_edu_ratio,employment_unemployed_ratio,vehicle_per_capita,vehicle_per_household,vacancy_ratio
0,2812.0,2812.0,1383.0,1429.0,39.4,931.0,2812.0,2086.0,517.0,0.0,...,0.014815,0.024242,0.183838,0.029798,0.00303,0.216667,0.286635,0.528094,1.595059,0.155938
1,4709.0,4709.0,2272.0,2437.0,34.2,1668.0,4709.0,2382.0,1953.0,0.0,...,0.02215,0.004615,0.135222,0.040245,0.00322,0.178686,0.318327,0.460183,1.299161,0.152869
2,5005.0,5005.0,2444.0,2561.0,34.1,1379.0,5005.0,2334.0,2206.0,224.0,...,0.026141,0.027913,0.213247,0.06462,0.007431,0.285299,0.366755,0.450949,1.636693,0.162211
3,6754.0,6754.0,2934.0,3820.0,31.3,2238.0,6754.0,4052.0,1671.0,326.0,...,0.052697,0.004054,0.093379,0.08251,0.012599,0.188488,0.314452,0.47483,1.432976,0.178716
4,3021.0,3021.0,1695.0,1326.0,44.1,1364.0,3021.0,2861.0,121.0,0.0,...,0.003014,0.013059,0.219868,0.138631,0.007064,0.365563,0.218447,0.659053,1.459677,0.33593


### 4.2 Viewing the loaded Data

In [62]:
# shape
df.shape

Index(['pop_total', 'sex_total', 'sex_male', 'sex_female', 'age_median',
       'households', 'race_total', 'race_white', 'race_black', 'race_native',
       'race_asian', 'inc_total_pop', 'inc_no_pop', 'inc_with_pop',
       'inc_pop_10k', 'inc_pop_1k_15k', 'inc_pop_15k_25k', 'inc_pop_25k_35k',
       'inc_pop_35k_50k', 'inc_pop_50k_65k', 'inc_pop_65k_75k', 'inc_pop_75k',
       'inc_median_ind', 'travel_total_to_work', 'travel_driving_to_work',
       'travel_pt_to_work', 'travel_taxi_to_work', 'travel_cycle_to_work',
       'travel_walk_to_work', 'travel_work_from_home', 'edu_total_pop',
       'bachelor_male_25_34', 'master_phd_male_25_34', 'bachelor_male_35_44',
       'master_phd_male_35_44', 'bachelor_male_45_64', 'master_phd_male_45_64',
       'bachelor_male_65_over', 'master_phd_male_65_over',
       'bachelor_female_25_34', 'master_phd_female_25_34',
       'bachelor_female_35_44', 'master_phd_female_35_44',
       'bachelor_female_45_64', 'master_phd_female_45_64',
       '

In [None]:
# view the head
df.head()

In [63]:
# view the tail
df.tail()

0       12086000211
1       12086000212
2       12086000213
3       12086000214
4       12086000128
           ...     
4162    12019031200
4163    12019030801
4164    12019030902
4165    12019030301
4166    12019031400
Name: full_ct_fips, Length: 4167, dtype: int64

In [64]:
# check the column names
df.columns

0        2812.0
1        4709.0
2        5005.0
3        6754.0
4        3021.0
         ...   
4162    15742.0
4163     5723.0
4164    10342.0
4165     8960.0
4166     5083.0
Name: pop_total, Length: 4167, dtype: float64

### 4.3 Selecting columns

In [66]:
# choose the census tract FIPS column.
# Q: What is FIPS? Federal Information Processing Standards.
# Q: What is a census tract? https://en.wikipedia.org/wiki/Census_tract
df['full_ct_fips']

0       3.020408
1       2.823141
2       3.629442
3       3.017873
4       2.214809
          ...   
4162    2.853362
4163    2.860070
4164    2.760812
4165    2.695548
4166    2.896296
Name: average_household_size, Length: 4167, dtype: float64

In [None]:
# choose a column by the column name
df['pop_total']

In [None]:
# choose multiple columns by a list of column names
df[['pop_total', 'households']]

In [67]:
# create a new column by the operation between two columns
df['average_household_size'] = df['pop_total']/df['households']
df['average_household_size']

pop_total                        2812.0
sex_total                        2812.0
sex_male                         1383.0
sex_female                       1429.0
age_median                         39.4
                                 ...   
employment_unemployed_ratio    0.286635
vehicle_per_capita             0.528094
vehicle_per_household          1.595059
vacancy_ratio                  0.155938
average_household_size         3.020408
Name: 0, Length: 89, dtype: object

## **Exercise** Choose the column showing the ratio of PhD graduates. Please find the column name from the documentation file. What are the values there?
## **Exercise** Choose the column showing the ratio of driving. Please find the column name from the documentation file. What are the values there?

### 4.4 Selecting Rows by Labels



In [None]:
# use .loc to select by row label
# returns the row as a series whose index is the dataframe column names
df.loc[0]

In [70]:
# use .loc to select single value by row label, column name
df.loc[0, 'pop_total']

Unnamed: 0,pop_total,sex_total,sex_male,sex_female,age_median,households,race_total,race_white,race_black,race_native,...,travel_work_home_ratio,edu_bachelor_ratio,edu_master_ratio,edu_phd_ratio,edu_higher_edu_ratio,employment_unemployed_ratio,vehicle_per_capita,vehicle_per_household,vacancy_ratio,average_household_size
1,4709.0,4709.0,2272.0,2437.0,34.2,1668.0,4709.0,2382.0,1953.0,0.0,...,0.004615,0.135222,0.040245,0.003220,0.178686,0.318327,0.460183,1.299161,0.152869,2.823141
2,5005.0,5005.0,2444.0,2561.0,34.1,1379.0,5005.0,2334.0,2206.0,224.0,...,0.027913,0.213247,0.064620,0.007431,0.285299,0.366755,0.450949,1.636693,0.162211,3.629442
3,6754.0,6754.0,2934.0,3820.0,31.3,2238.0,6754.0,4052.0,1671.0,326.0,...,0.004054,0.093379,0.082510,0.012599,0.188488,0.314452,0.474830,1.432976,0.178716,3.017873
4,3021.0,3021.0,1695.0,1326.0,44.1,1364.0,3021.0,2861.0,121.0,0.0,...,0.013059,0.219868,0.138631,0.007064,0.365563,0.218447,0.659053,1.459677,0.335930,2.214809
7,4215.0,4215.0,2194.0,2021.0,38.8,1032.0,4215.0,3327.0,642.0,106.0,...,0.023061,0.116590,0.028531,0.008101,0.153223,0.310919,0.452669,1.848837,0.110345,4.084302
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4162,15742.0,15742.0,7957.0,7785.0,41.0,5517.0,15742.0,13894.0,1128.0,64.0,...,0.062212,0.164241,0.084891,0.002066,0.251197,0.401551,0.426820,1.217872,0.099118,2.853362
4163,5723.0,5723.0,2914.0,2809.0,43.0,2001.0,5723.0,4664.0,482.0,0.0,...,0.047581,0.215161,0.084563,0.007005,0.306730,0.411036,0.440678,1.260370,0.039827,2.860070
4164,10342.0,10342.0,4657.0,5685.0,37.6,3746.0,10342.0,7956.0,1351.0,13.0,...,0.038862,0.137002,0.030591,0.002049,0.169643,0.353295,0.482692,1.332621,0.041208,2.760812
4165,8960.0,8960.0,4166.0,4794.0,37.2,3324.0,8960.0,6286.0,1831.0,0.0,...,0.064132,0.174399,0.063014,0.003126,0.240540,0.363482,0.478571,1.290012,0.017440,2.695548


In [71]:
# slice of rows from label 2 to label 4, inclusive
# this returns a pandas dataframe
df.loc[2:4]

0       False
1        True
2        True
3        True
4        True
        ...  
4162     True
4163     True
4164     True
4165     True
4166     True
Name: pop_total, Length: 4167, dtype: bool

### 4.5 Selecting using value criteria


In [None]:
# filter the dataframe by census tracts with more than 3000 residents
df[df['pop_total'] > 3000]

In [None]:
# In fact, there are two steps.
# step 1. create a mask
mask = df['pop_total'] > 3000
mask

In [73]:
# step 2. Choose the rows
df[mask]

array([ 2812.,  4709.,  5005., ..., 10342.,  8960.,  5083.])

## 5. Analyzing data

### 5.1 Finding nominal, ordinal, and cardinal numbers in the data set.

In [76]:
# majority of the variables in THIS data set is cardinal numbers.
# total population: cardinal numbers.
df['pop_total'].values

0       1
1       0
2       0
3       0
4       1
       ..
4162    1
4163    1
4164    1
4165    1
4166    0
Name: high_low_inc, Length: 4167, dtype: int64

In [77]:
# median household income: cardinal numbers.
df['inc_median_household'].values

0       86
1       86
2       86
3       86
4       86
        ..
4162    19
4163    19
4164    19
4165    19
4166    19
Name: county_fips, Length: 4167, dtype: int64

In [77]:
# avearge vehicle per household: cardinal numbers.
df['vehicle_per_household'].values

In [None]:
# ordinal example
# turn the cardinal to ordinal numbers.
df['high_low_inc'] = df['inc_median_household'] > 50000 # Separate high vs. low income
df['high_low_inc'].astype('int') # change the type to integer: 1 - high income; 0 - low income

In [78]:
# county FIPS: nominal numbers
df['county_fips']

5015.991360691145
4519.0


59617.548644108465
54140.0


### 5.2 Summarizing central tendency (one variable)


In [81]:
# summarizing the mean and median: population
print(df['pop_total'].mean())
print(df['pop_total'].median())

1.8593297172274632
1.21645736946464


In [82]:
# summarizing the mean and median: income
print(df['inc_median_household'].mean())
print(df['inc_median_household'].median())

71.62634989200863
86.0


In [None]:
# summarizing the mean and median: property value
print(df['property_value_median'].mean())
print(df['property_value_median'].median())

In [None]:
# summarizing the mean and median: vehicles per household
print(df['vehicle_per_household'].mean())
print(df['vehicle_per_household'].median())

In [83]:
# Q: Can we summarize the mean and median of county FIPS? A: No. You can get a number but it has no meaning.
print(df['county_fips'].mean())
print(df['county_fips'].median())

8909397.168912383
2984.8613316052697


### 5.3 Summarizing variability (one variable)

Under the normal distribution - The empirical rule, or the 68-95-99.7 rule, tells you where your values lie: Around 68% of scores are within 1 standard deviation of the mean, Around 95% of scores are within 2 standard deviations of the mean

In [None]:
# summarizing the variance and std: population
print(df['pop_total'].var())
print(df['pop_total'].std())

In [None]:
# summarizing the variance and std: income
print(df['inc_median_household'].var())
print(df['inc_median_household'].std())

In [86]:
# summarizing the variance and std: vehicles per household
print(df['vehicle_per_household'].var())
print(df['vehicle_per_household'].std())

## **Exercise.** Choose the column showing the ratio of driving. Please find the column name from the documentation file. Please compute the mean, median, variance, and standard deviation.  


### 5.4 Visualizing one variable

In [None]:
# matplotlib
# matplotlib is the most common visualization tool in python.
import matplotlib.pyplot as plt

In [None]:
# use histogram - the first trial.
var = 'pop_total'
n, bins, graph = plt.hist(x=df[var], bins=20)

In [None]:
# histograms - the second trial with more details.
var = 'pop_total'
n, bins, graph = plt.hist(x=df[var], bins=20)
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Population')
plt.ylabel('Frequency')
plt.title('Population of the Census tracts in Florida')
plt.text(30000, 800, r'$\mu=5015, \sigma=2984$')

## **Exercise.** Choose the column showing the ratio of driving. Please plot the histogram, and compute the mean and standard deviation.

## **Final Note**: Please take the Practicum AI at UFL to learn more about python coding for AI.
## **Link**: https://calendar.hr.ufl.edu/event/practicum-ai/all/
