# Module 1--The basics of Python


- Module 1 covers python basics
    - Notes (Current page)
    - Youtube video

## I.1 Running Python


### I.1.1 Google's Colab  (for beginners)
- My recommendation for those new to python
- See Section I.2 on how Colab can be launched for free from your google drive by clicking on the following button

### I.1.2 Anaconda (for more experienced users)
- Anaconda downloads python and jupyter notebook (you are currently viewing a jupyter notebook)
- [Link](https://www.anaconda.com/products/individual)

### I.1.3 UV (for advanced users)
- UV downloads python and allows for version control
- [Link](v/getting-started/installation/)

### I.1.4 UV (for beginners)
- Use binder by clicking on the binder button on our course homepage: https://github.com/Data-Science-Public-Policy/graspp_2025_spring/tree/main

<img src="Screenshots/binder.png" />


## I.2 How to run Google Co-lab
### I.2.1 How to launch
<img src="Screenshots/colab.png" style="width: 500px; height: 500px;" />

### I.2.2 How to run colab

- Run code:
    - Press play button next to code line
    - Shift + Return
        - Short cut for macs
        
- Reset notebook
    - Runtime -> Restart runtime
    
- Code versus text cells
    - You cannot run code in a text cell!

# 1. Importing Data
- All data and code for this course including this notebook is viewable on our course page below
    - https://github.com/Data-Science-Public-Policy/graspp_2025_spring/tree/main

## 1.1. Importing data from our class page/the internet

### 1.1.A. Saving information in this notebook

In [1]:
# This line saves the web address (URL) of the World Bank data file as text in a variable named 'url'.
url = "https://github.com/Data-Science-Public-Policy/graspp_2025_spring/raw/refs/heads/module_1/data/examples/module_1/world_bank_data.csv"

In [2]:
# This line prints the url
url

'https://github.com/Data-Science-Public-Policy/graspp_2025_spring/raw/refs/heads/module_1/data/examples/module_1/world_bank_data.csv'

### 1.1.1 Importing the data

In [3]:
# This line uses the pandas library's 'read_csv' function.
# A DataFrame is essentially a Python version of an Excel spreadsheet, a two-dimensional table with rows and columns.
import pandas as pd
df = pd.read_csv(url)

# The 'head(2)' returns the first 2 rows of the DataFrame.
df.head(2)

Unnamed: 0.1,Unnamed: 0,indicator,country,countryiso3code,date,value
0,0,GDP per capita (current US$),Canada,CAN,2023,53431.185706
1,1,GDP per capita (current US$),Canada,CAN,2022,55509.393176


In [4]:
# This line does the exact same thing as the previous line, but it directly uses the URL string, instead of the variable 'url'.
# It reads the World Bank data from the given URL and stores it as a pandas DataFrame named 'df'.
df = pd.read_csv("https://github.com/Data-Science-Public-Policy/graspp_2025_spring/raw/refs/heads/module_1/data/examples/module_1/world_bank_data.csv")

# The 'head(3)' returns the first 3 rows of the DataFrame.
df.head(3)

Unnamed: 0.1,Unnamed: 0,indicator,country,countryiso3code,date,value
0,0,GDP per capita (current US$),Canada,CAN,2023,53431.185706
1,1,GDP per capita (current US$),Canada,CAN,2022,55509.393176
2,2,GDP per capita (current US$),Canada,CAN,2021,52496.844169


## 1.2 Importing directly from website excel link

In [5]:
url = 'https://en.wikipedia.org/wiki/List_of_largest_cities'
df = pd.read_html(url)
df[1].head(5)

Unnamed: 0_level_0,City[a],Country,UN 2018 population estimates[b],City proper[c],City proper[c],City proper[c],City proper[c],Urban area[12],Urban area[12],Urban area[12],Metropolitan area[d],Metropolitan area[d],Metropolitan area[d]
Unnamed: 0_level_1,City[a],Country,UN 2018 population estimates[b],Definition,Population,Area (km2),Density (/km2),Population,Area (km2),Density (/km2),Population,Area (km2),Density (/km2)
0,Tokyo,Japan,37468000,Metropolis prefecture,13515271,2191,"6,169 [13]",37785000,8231,"4,591 [e]",37274000,13452,"2,771 [14]"
1,Delhi,India,28514000,Municipal corporation,16753235,1484,"11,289 [15]",32226000,2344,"13,748 [f]",29000000,3483,"8,326 [16]"
2,Shanghai,China,25582000,Municipality,24870895,6341,"3,922 [17][18]",24073000,4333,"5,556 [g]",—,—,—
3,São Paulo,Brazil,21650000,Municipality,12252023,1521,"8,055 [19]",23086000,3649,"6,327 [h]",21734682,7947,"2,735 [20]"
4,Mexico City,Mexico,21581000,City-state,9209944,1485,"6,202 [21]",21804000,2530,8618,21804515,7866,"2,772 [22]"


## 1.3 Importing data via an API

### 1.3.A Introducing functions

In [6]:
# We already used functions!

# function to read csvs from the pandas library
# pd.read_csv()
# function to read html from the pandas library
# pd.read_html()

#### 1.3.A.1 Basic function

In [7]:
# Now we make our own
def num_print(a):
    return a

In [8]:
num_print(8)

8

In [9]:
num_print(7)

7

#### 1.3.A.2 Multiple arguments

In [10]:
def add_func(a,b,c):
    print('This function adds')
    final = a+b+c
    return final

In [11]:
add_func(10,20,10)

This function adds


40

### 1.3.1 Function to download data from the world bank WDI

In [12]:
import requests
import pandas as pd

def download_worldbank(indicator, countries, date_start, date_end):
    url_base = 'http://api.worldbank.org/v2/'  # Base URL for the World Bank API
    country_codes = ';'.join(countries)  # Combine country codes into a string
    url = url_base + f'country/{country_codes}/indicator/{indicator}?date={date_start}:{date_end}&per_page=30000' #create the url with start and end date.
    url = url_base + f'country/{country_codes}/indicator/{indicator}?per_page=30000' # This line overrides the previous one. It will ignore start/end date.

    response = requests.get(url)  # Download data from the URL
    df = pd.read_xml(response.content)  # Convert the downloaded data to a table
    return df  # Return the table

In [13]:
# Example 1
data = download_worldbank(
    indicator = 'NY.GDP.PCAP.CD' , 
    countries = ['US', 'CA', 'MX', 'JP'],  
    date_start = '2021', 
    date_end = '2023'
)
data.head(2)

Unnamed: 0,indicator,country,countryiso3code,date,value,unit,obs_status,decimal
0,GDP per capita (current US$),Canada,CAN,2023,53431.185706,,,1
1,GDP per capita (current US$),Canada,CAN,2022,55509.393176,,,1


In [14]:
# Example 2: Using saved objects
indicator_code = 'SP.POP.TOTL'  # Example: Total population
country_list = ['FR', 'DE', 'IT'] # France, Germany, Italy
start_year = '2020'
end_year = '2022'

data2 = download_worldbank(
    indicator=indicator_code,
    countries=country_list,
    date_start=start_year,
    date_end=end_year
)
data2.head(2)

Unnamed: 0,indicator,country,countryiso3code,date,value,unit,obs_status,decimal
0,"Population, total",Germany,DEU,2023,83280000,,,0
1,"Population, total",Germany,DEU,2022,83797985,,,0


# 2. Where is this notebook? Where is my Data? 


### IMPORTANT: If you are running on google colab you need to drag the csv file into your workspace
###

## 2.1 `File path`: fancy way to say the current folder you are working from

### 2.1.A Current folder

In [15]:
# Current path
import os
# os.getcwd(): Shows current folder this notebook is in
os.getcwd()

'/Users/corybaird/Desktop/graspp_2025_spring/notebooks/module_1/week_1'

In [16]:
# This shows the files in the folder three folders above the current folder
os.listdir("../../..")

['.DS_Store',
 'requirements.txt',
 'uv.lock',
 'environment.yml',
 'pyproject.toml',
 'README.md',
 '.gitignore',
 '.venv',
 '.git',
 'data',
 'notebooks',
 'src']

In [17]:
# This shows the files in the data/examples/module_1 folder
os.listdir("../../../data/examples/module_1")

['world_bank_data.csv']

### 2.1.1 Saving data: In this case words (a.k.a. a string)

In [18]:
# In python if you write a name (without spaces) and you write something on the right hand side this saves data/information
file_location = "../../../data/examples/module_1/"
file_location

'../../../data/examples/module_1/'

## 2.2 Importing data (csv, excel, stata) to the notebook

In [19]:
os.listdir(file_location)

['world_bank_data.csv']

### 2.2.1 Import

In [20]:
import pandas as pd
pd.read_csv("../../../data/examples/module_1/world_bank_data.csv")
pd.read_csv(file_location + "world_bank_data.csv") #You can combine words with a +

Unnamed: 0.1,Unnamed: 0,indicator,country,countryiso3code,date,value
0,0,GDP per capita (current US$),Canada,CAN,2023,53431.185706
1,1,GDP per capita (current US$),Canada,CAN,2022,55509.393176
2,2,GDP per capita (current US$),Canada,CAN,2021,52496.844169
3,3,GDP per capita (current US$),Japan,JPN,2023,33766.526825
4,4,GDP per capita (current US$),Japan,JPN,2022,34017.271808
5,5,GDP per capita (current US$),Japan,JPN,2021,40058.537328
6,6,GDP per capita (current US$),Mexico,MEX,2023,13790.024343
7,7,GDP per capita (current US$),Mexico,MEX,2022,11385.407076
8,8,GDP per capita (current US$),Mexico,MEX,2021,10314.050674
9,9,GDP per capita (current US$),United States,USA,2023,82769.412211


### 2.2.2 Import and save to notebook (same concept as 1.1.1)

In [21]:
# In python if you write a name (without spaces) and you write something on the right hand side this saves data/information
df = pd.read_csv(file_location + "world_bank_data.csv") #You can combine words with a +
df

Unnamed: 0.1,Unnamed: 0,indicator,country,countryiso3code,date,value
0,0,GDP per capita (current US$),Canada,CAN,2023,53431.185706
1,1,GDP per capita (current US$),Canada,CAN,2022,55509.393176
2,2,GDP per capita (current US$),Canada,CAN,2021,52496.844169
3,3,GDP per capita (current US$),Japan,JPN,2023,33766.526825
4,4,GDP per capita (current US$),Japan,JPN,2022,34017.271808
5,5,GDP per capita (current US$),Japan,JPN,2021,40058.537328
6,6,GDP per capita (current US$),Mexico,MEX,2023,13790.024343
7,7,GDP per capita (current US$),Mexico,MEX,2022,11385.407076
8,8,GDP per capita (current US$),Mexico,MEX,2021,10314.050674
9,9,GDP per capita (current US$),United States,USA,2023,82769.412211


# A. The" boring stuff: explaining what we just did 

## A. Note for R-users

- Python is in many ways similiar to R

- Comment code: #
- Some functions are exactly the same: print()
- Saving an object requires you to write a name and set it equal to whatever object you are interested in saving
- You must run the code in the correct order
- We use libraries!

In [22]:
# REMEMBER THIS IS WHAT WE USED BEFORE
# This line saves the web address (URL) of the World Bank data file as text in a variable named 'url'.
url = "https://github.com/Data-Science-Public-Policy/graspp_2025_spring/raw/refs/heads/module_1/data/examples/module_1/world_bank_data.csv"

In [23]:
# This is a comment
print('This is a print function')

This is a print function


In [24]:
saved_object = 'hello world'
print(saved_object)

hello world


In [25]:
saved_object = 'GRASPP IS COOL'
saved_object

'GRASPP IS COOL'

## A.1 Strings

- Think of it as a piece data that has quotes
- R-users: This is the same as in R!


In [26]:
my_str = 'Python is cool'
print(my_str)

Python is cool


### A.1.1 This function can tell us what type our object is

In [27]:
type(my_str)

str

### A.1.2 Overwriting

In [28]:
my_str = 'Python is NOT cool'
my_str

'Python is NOT cool'

## A.2 Integers/Float

In [29]:
my_num = 100
type(my_num)

int

In [30]:
my_num = 100.9483859390942949
type(my_num)

float

## A.2 Lists

- Lists can be made up of strings or numbers
- Spacing is not necessarily important!

In [31]:
# REMEMBER WE USED LISTS BEFORE
# Example 2: Using saved objects
indicator_code = 'SP.POP.TOTL'  # Example: Total population
country_list = ['FR', 'DE', 'IT'] # France, Germany, Italy
start_year = '2020'
end_year = '2022'

data2 = download_worldbank(
    indicator=indicator_code,
    countries=country_list,
    date_start=start_year,
    date_end=end_year
)
data2.head(2)

Unnamed: 0,indicator,country,countryiso3code,date,value,unit,obs_status,decimal
0,"Population, total",Germany,DEU,2023,83280000,,,0
1,"Population, total",Germany,DEU,2022,83797985,,,0


In [32]:
my_list = ['Python', 'R', 'Stata', 'Excel']

my_list

['Python', 'R', 'Stata', 'Excel']

In [33]:
my_list = [3, 
           5,
           9,
           11]
type(my_list)

list

## A.3 Dictionaries

In [34]:
rename_country_map = {
    "Canada": "Canada",
    "Mexico": "United Mexican States",
    "United States": "United States of America",
    "Germany": "Federal Republic of Germany",
}

In [35]:
rename_country_map.values()

dict_values(['Canada', 'United Mexican States', 'United States of America', 'Federal Republic of Germany'])

In [36]:
rename_country_map.keys()

dict_keys(['Canada', 'Mexico', 'United States', 'Germany'])

# B. Introduction to Pandas

### For R users: Pandas is similar to DPLYR!

In general we use it to:

- Manipulate data (Today's topic)
- Viusualize our data (graphing etc.)
- Download data directly from the internet
- Build models (Regression, Machine learning, Neural Networks)

## B.A We must import the library before using!
- Again this is similar to R
- However unlike R: in order to use the functions (in general) we have to use an acronym to access the functions

In [37]:
import pandas as pd

## B.1 Import data
- pd.read_stata
- pd.read_csv
- pd.read_excel

In [38]:
url = "https://github.com/Data-Science-Public-Policy/graspp_2025_spring/raw/refs/heads/module_1/data/examples/module_1/world_bank_data.csv"
df = pd.read_csv(url)

In [39]:
type(df)

pandas.core.frame.DataFrame

# B.2 Basics


### B.2.1 Info

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       12 non-null     int64  
 1   indicator        12 non-null     object 
 2   country          12 non-null     object 
 3   countryiso3code  12 non-null     object 
 4   date             12 non-null     int64  
 5   value            12 non-null     float64
dtypes: float64(1), int64(2), object(3)
memory usage: 708.0+ bytes


### B.2.2 Head, tail


In [41]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,indicator,country,countryiso3code,date,value
0,0,GDP per capita (current US$),Canada,CAN,2023,53431.185706
1,1,GDP per capita (current US$),Canada,CAN,2022,55509.393176


In [42]:
df.tail(2)

Unnamed: 0.1,Unnamed: 0,indicator,country,countryiso3code,date,value
10,10,GDP per capita (current US$),United States,USA,2022,78035.17536
11,11,GDP per capita (current US$),United States,USA,2021,71318.307359


## B.2.3 Descriptive stats

In [43]:
df.describe()

Unnamed: 0.1,Unnamed: 0,date,value
count,12.0,12.0,12.0
mean,5.5,2022.0,44741.011336
std,3.605551,0.852803,25272.662972
min,0.0,2021.0,10314.050674
25%,2.75,2021.0,28772.401205
50%,5.5,2022.0,46277.690748
75%,8.25,2023.0,59461.621722
max,11.0,2023.0,82769.412211


## B.2.4 Column names

In [44]:
 df.columns

Index(['Unnamed: 0', 'indicator', 'country', 'countryiso3code', 'date',
       'value'],
      dtype='object')

In [45]:
columns = df.columns

#### Manipulating lists

In [46]:
# The second element to the third element
columns[1:3] # Second element in slice is not inclusive

Index(['indicator', 'country'], dtype='object')

In [47]:
# Indexing starts at zero
columns[0:2]

Index(['Unnamed: 0', 'indicator'], dtype='object')

In [48]:
columns[-2:]

Index(['date', 'value'], dtype='object')

## B.3 Select and Filter

### B.3.1 Select column

In [49]:
df['country'].head(2)

0    Canada
1    Canada
Name: country, dtype: object

In [50]:
df.country.head(2)

0    Canada
1    Canada
Name: country, dtype: object

## B.3.2 Subset Rows

<img src="Screenshots/subset_row.svg" />


### B.3.2.1 Show unique values in a column

In [51]:
df.country.unique()

array(['Canada', 'Japan', 'Mexico', 'United States'], dtype=object)

#### B.3.2.2 Select Row

In [52]:
df.query("country == 'Japan'")

Unnamed: 0.1,Unnamed: 0,indicator,country,countryiso3code,date,value
3,3,GDP per capita (current US$),Japan,JPN,2023,33766.526825
4,4,GDP per capita (current US$),Japan,JPN,2022,34017.271808
5,5,GDP per capita (current US$),Japan,JPN,2021,40058.537328


In [53]:
mask = df.country == 'Japan'
mask[:2]

0    False
1    False
Name: country, dtype: bool

In [54]:
df[mask]
df.loc[mask]

Unnamed: 0.1,Unnamed: 0,indicator,country,countryiso3code,date,value
3,3,GDP per capita (current US$),Japan,JPN,2023,33766.526825
4,4,GDP per capita (current US$),Japan,JPN,2022,34017.271808
5,5,GDP per capita (current US$),Japan,JPN,2021,40058.537328


#### B.2.2.3 Subset multiple conditions

In [55]:
df.query("country == 'Japan' & date>2021")

Unnamed: 0.1,Unnamed: 0,indicator,country,countryiso3code,date,value
3,3,GDP per capita (current US$),Japan,JPN,2023,33766.526825
4,4,GDP per capita (current US$),Japan,JPN,2022,34017.271808


#### B.3.2.4 Subset in a list

In [56]:
df.query("country in ['Japan', 'Mexico']")

Unnamed: 0.1,Unnamed: 0,indicator,country,countryiso3code,date,value
3,3,GDP per capita (current US$),Japan,JPN,2023,33766.526825
4,4,GDP per capita (current US$),Japan,JPN,2022,34017.271808
5,5,GDP per capita (current US$),Japan,JPN,2021,40058.537328
6,6,GDP per capita (current US$),Mexico,MEX,2023,13790.024343
7,7,GDP per capita (current US$),Mexico,MEX,2022,11385.407076
8,8,GDP per capita (current US$),Mexico,MEX,2021,10314.050674


### B.3.2.5 Subset rows and columns

In [57]:
df.query("country in ['Japan', 'Mexico']")[['country', 'date']]

Unnamed: 0,country,date
3,Japan,2023
4,Japan,2022
5,Japan,2021
6,Mexico,2023
7,Mexico,2022
8,Mexico,2021


## B.3 Rename 

### B.3.1 Rename columns

In [58]:
df.rename({"value" : "GDP"}, axis=1).head(2)

Unnamed: 0.1,Unnamed: 0,indicator,country,countryiso3code,date,GDP
0,0,GDP per capita (current US$),Canada,CAN,2023,53431.185706
1,1,GDP per capita (current US$),Canada,CAN,2022,55509.393176


### B.3.2 Rename rows

In [59]:
rename_country_map = {
    "Canada" : "O CANANDA"
}
df.country.replace(rename_country_map)
df.country.map(rename_country_map)

0     O CANANDA
1     O CANANDA
2     O CANANDA
3           NaN
4           NaN
5           NaN
6           NaN
7           NaN
8           NaN
9           NaN
10          NaN
11          NaN
Name: country, dtype: object

## B.4 Apply functions to pandas

### B.4.1 Basics

In [60]:
def basics(data):
    print('Head')
    print(data.head(2))
    print('')
    print('Tail')
    print(data.tail(2))
    print('')
    print('Columns')
    print(data.columns[:3])
    print('')
    print('Info')
    print(data.info())

basics(df)

Head
   Unnamed: 0                     indicator country countryiso3code  date  \
0           0  GDP per capita (current US$)  Canada             CAN  2023   
1           1  GDP per capita (current US$)  Canada             CAN  2022   

          value  
0  53431.185706  
1  55509.393176  

Tail
    Unnamed: 0                     indicator        country countryiso3code  \
10          10  GDP per capita (current US$)  United States             USA   
11          11  GDP per capita (current US$)  United States             USA   

    date         value  
10  2022  78035.175360  
11  2021  71318.307359  

Columns
Index(['Unnamed: 0', 'indicator', 'country'], dtype='object')

Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       12 non-null     int64  
 1   indicator        12 non-null     object 
 2   country          12 no

### B.4.2 subset

In [61]:
def subset(countries, columns):
    out = df.query(f"country in {countries}")[columns]
    return out
subset(
    countries = ['Japan', 'Mexico'],
    columns = ['country', 'date']
)

Unnamed: 0,country,date
3,Japan,2023
4,Japan,2022
5,Japan,2021
6,Mexico,2023
7,Mexico,2022
8,Mexico,2021
