## Discussion #4

Files needed = None

OH: Mondays and Wednesdays 9:15-10:15am in 6473 Sewell Social Sciences

Email: minnie.cui@wisc.edu

**Reminder:** Project team rosters is due tonight, and coding practice #2 is uploaded!

#### This week we saw:
- More pandas
- Reading files

## Pandas

Last week we covered a bunch of useful methods and functions in `pandas`, such as `head`, `tail`, `shape`, `iloc`, `loc`, `rename`, etc. We'll cover a few more today.

In [5]:
# Import libraries
import pandas as pd

# Generate data
state_data = {'state':['CA','MI','WI','MN'], 
              'pop':[37, 9.8, 5.7, '5.3'], 
              'size':[163.7, 96.7, 65.5, 86.9], 
              'bird':['Quail', 'Redbreast Robin', 'American Robin', 'Loon']
             }
state_data = pd.DataFrame(state_data)

# Try out some functions
# print(state_data.dtypes)
state_data['pop'] = state_data['pop'].astype(float)
#print("")
print(state_data.dtypes)
# print(state_data.head(2))
# print("")
# print(state_data.tail(2))
# print("")
# print(state_data.sample(2))
state_data.describe()


state     object
pop      float64
size     float64
bird      object
dtype: object


Unnamed: 0,pop,size
count,4.0,4.0
mean,14.45,103.2
std,15.170256,42.38506
min,5.3,65.5
25%,5.6,81.55
50%,7.75,91.8
75%,16.6,113.45
max,37.0,163.7


In [6]:
# Renaming, can also manually write it 
old_names = state_data.columns.to_list()
new_names = state_data.columns.to_list()
new_names[1] = 'population'
names = dict(zip(old_names, new_names))    #zip parallelizes iteratbles 
print(names)

state_data = state_data.rename(columns=names)
state_data.head(2)

{'state': 'state', 'pop': 'population', 'size': 'size', 'bird': 'bird'}


Unnamed: 0,state,population,size,bird
0,CA,37.0,163.7,Quail
1,MI,9.8,96.7,Redbreast Robin


In [7]:
print(state_data)
print("")

# Sums 
print('Sum across columns')
print(state_data[['population', 'size']].sum(axis=1))  #axis = 1 sum across column, row is preserved; axis = 0  sum across rows

print('\nSum across rows')
# ['population', 'size']
print(state_data[['population', 'size']].sum(axis=0)) #default

print('\nSum up population')
print(state_data['population'].sum())

  state  population   size             bird
0    CA        37.0  163.7            Quail
1    MI         9.8   96.7  Redbreast Robin
2    WI         5.7   65.5   American Robin
3    MN         5.3   86.9             Loon

Sum across columns
0    200.7
1    106.5
2     71.2
3     92.2
dtype: float64

Sum across rows
population     57.8
size          412.8
dtype: float64

Sum up population
57.8


In [8]:
# Means
print('\nMean of each column')
#print(state_data.mean(axis=0)) 

print('\nMean population and size')
print(state_data[['population', 'size']].mean(axis=0))  #must specify names


Mean of each column

Mean population and size
population     14.45
size          103.20
dtype: float64


We can use TAB completion to check out all the methods available or the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

For example: `sum`, `mean`, `var`, `std`, `skew`, `rank`, `quantile`, `mode`, `min`, `max`, `kurtosis`, etc.

In [9]:
state_data.

SyntaxError: invalid syntax (927745020.py, line 1)

## Reading/Writing Data

We can use `pd.read_csv('filename.csv')` or `pd.read_excel('filename.xlsx')` to read in data from CSVs and Excel spreadsheets. It's super important to remember to be in the right directory! We can use the `os` package to get our current working directory or change directories.

Some useful options are `sheet_name` (for Excel files), `header` (to skip rows to the column names), `index_col` (to automatically generate an index), `na_values` (for specific missing values), and `usecols` (to select a few columns).

## Example: Pisa Data

Many thanks to Greg for this example! The [pisa](https://www.oecd.org/pisa/) test is a test given to 15-year olds around the world. It evaluates reading, math, and science skills. 

In a web browser, go to [dx.doi.org/10.1787/888932937035](http://dx.doi.org/10.1787/888932937035) and download the spreadsheet with pisa scores by country. Check out the (very messy) data.

This was formatted as a spreadsheet rather than something easy for a computer to read ("rectangular" data). We'll need to wrangle it into shape.

Then, use `read_excel()` to create a DataFrame named `pisa` with countries, the mean scores in math, reading, and science. You can also read the file directly from the Internet!

   1. Notice the last few rows of the data. Do we want to include these? Check out the `skipfooter` option. Use the documentation for `read_excel()` to learn more.  
   2. There's also something strange at the top of the spreadsheet. Check out the `skiprows` option.
   3. You can select the columns you want from the spreadsheet with the `usecols` option. Or, you could just delete columns you don't want once you've loaded the data.

In [None]:
# Run this first to install requirements needed on Winstat
!pip install xlrd
# After you run this cell, click Kernel --> Restart Kernel and clear all outputs 
# Then run the code cell below again

In [3]:
# Reading from a link
import pandas as pd
url = 'http://dx.doi.org/10.1787/888932937035'
pisa = pd.read_excel(url,
                     skiprows=18,             # skip the first 18 rows
                     skipfooter=7,            # skip the last 7 rows
                     usecols=[0,1,9,13],      # select columns of interest
                     )

# Renaming
old_names = pisa.columns.to_list()
new_names = pisa.columns.to_list()
new_names[0] = "Country"
names = dict(zip(old_names, new_names))
print(names)
pisa = pisa.rename(columns=names)

# Display
pisa.head()

{'Unnamed: 0': 'Country', 'Mathematics   ': 'Mathematics   ', 'Reading  ': 'Reading  ', 'Science  ': 'Science  '}


Unnamed: 0,Country,Mathematics,Reading,Science
0,,Mean score in PISA 2012,Mean score in PISA 2012,Mean score in PISA 2012
1,,,,
2,OECD average,494.046447,496.462864,501.159793
3,,,,
4,Shanghai-China,612.675536,569.588408,580.117831


There are lots of `NaN`s (missing values). How do we handle those?

Look up `dropna()` in the pandas documentation. Clean up your DataFrame. Drop any rows that have *at least one* `NaN`. Save the result into a new DataFrame named `pisa2`.

How many rows are in `pisa2`? 

In [4]:
# Drop missing values
pisa2 = pisa.dropna()
print(pisa.shape)        #shape of original dataset
print(pisa2.shape)
pisa2.head(5)

(69, 4)
(66, 4)


Unnamed: 0,Country,Mathematics,Reading,Science
2,OECD average,494.046447,496.462864,501.159793
4,Shanghai-China,612.675536,569.588408,580.117831
5,Singapore,573.468314,542.215834,551.493157
6,Hong Kong-China,561.241096,544.600086,554.937434
7,Chinese Taipei,559.824796,523.118904,523.314904


Using `pisa2`, make the country names the index.

In [5]:
# Set the index to country, avoid counting index num
pisa2 = pisa2.set_index('Country')
print(pisa2.head(2))

               Mathematics      Reading     Science  
Country                                              
OECD average       494.046447  496.462864  501.159793
Shanghai-China     612.675536  569.588408  580.117831


Using `pisa2`, print out the ratios of the United States pisa scores (math, reading, science) relative to the OECD average.

In [6]:
# Print the US relative to the average -- ratio
pisa2.loc['United States'] / pisa2.loc['OECD average']    

Mathematics       0.974335
Reading           1.002254
Science           0.992517
dtype: object

In [9]:
# Are all the columns floats? If they weren't, how would we convert them?
print(pisa.dtypes)

pisa2.columns= ["Math", "Read", "Sci"]
print(pisa2.dtypes)
pisa2 = pisa2[["Math", "Read", "Sci"]].apply(pd.to_numeric)    
print(pisa2.dtypes)
# Insert your comments here. 

Country           object
Mathematics       object
Reading           object
Science           object
dtype: object
Math    object
Read    object
Sci     object
dtype: object
Math    float64
Read    float64
Sci     float64
dtype: object


## Some skills you'll need as you begin your project proposals...

When putting your proposals together with your project teams, keep in mind a few things to make your work look clear and professional.

### Word and Latex templates on Canvas

You can submit your proposal in Word or Latex. We've put together some templates for you to follow in the [Project Information](https://canvas.wisc.edu/courses/398123/pages/project-information) page on Canvas. 

Latex is one of the most popular typesetting programs. It's great for formatting papers and reports and I'd recommend looking into it. Of course, you can make a professional document in Word as well. Feel free to reach out if you have any questions on Latex.

### Proper bibliographic citation 

You need to cite **all** referenced work in your proposal (and final reports)! Below is an example of proper citation: 

One of the most commonly known habits of opossums is “playing dead” or, as it is frequently called, “playing possum” (Cui 2024). This is real, although the opossum is not playing, which suggests there is some intent at work.

An opossum, when confronted with a threat, will often hiss or bare its teeth. Or more likely, run. But if it is surprised by a predator, it will enter a catatonic state. It basically faints and is in a state of unconsciousness. The opossum has no control over this; it’s involuntary.

#### Bibliography
Cui, M. (2024). Opossums play dead when confronted. *Journal of Opossums Habits 3*(32), 123-4. 

In [None]:
#APA, Chicago

## Have a great weekend!