# POLI 175 - Machine Learning for Social Sciences

## Python Refresh II

---

## Load Pandas and Numpy

To get started, let us load Pandas and Numpy:

In [1]:
# My code here
import pandas as pd
import numpy as np

## Load Datasets

We will load three datasets here:

In [2]:
# My code here
PErisk = pd.read_csv('https://raw.githubusercontent.com/umbertomig/POLI175public/main/data/PErisk.csv')
tips = pd.read_csv('https://raw.githubusercontent.com/umbertomig/POLI175public/main/data/tips.csv')

Let's explore the datasets we just loaded.

In [3]:
# My code here
PErisk.head()

Unnamed: 0,country,courts,barb2,prsexp2,prscorr2,gdpw2
0,Argentina,0,-0.720775,1,3,9.69017
1,Australia,1,-6.907755,5,4,10.30484
2,Austria,1,-4.910337,5,4,10.10094
3,Bangladesh,0,0.775975,1,0,8.379768
4,Belgium,1,-4.617344,5,4,10.25012


In [4]:
PErisk.describe()

Unnamed: 0,courts,barb2,prsexp2,prscorr2,gdpw2
count,62.0,62.0,62.0,62.0,62.0
mean,0.451613,-2.925557,3.274194,2.532258,9.041875
std,0.501716,2.707211,1.369089,1.501013,0.970264
min,0.0,-6.907755,0.0,0.0,7.029973
25%,0.0,-4.894882,3.0,1.25,8.381027
50%,0.0,-2.353233,3.0,2.0,9.185412
75%,1.0,-1.301007,4.0,4.0,9.88928
max,1.0,2.337425,5.0,5.0,10.41018


In [5]:
PErisk.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   country   62 non-null     object 
 1   courts    62 non-null     int64  
 2   barb2     62 non-null     float64
 3   prsexp2   62 non-null     int64  
 4   prscorr2  62 non-null     int64  
 5   gdpw2     62 non-null     float64
dtypes: float64(2), int64(3), object(1)
memory usage: 3.0+ KB


In [6]:
PErisk.sample(10)

Unnamed: 0,country,courts,barb2,prsexp2,prscorr2,gdpw2
20,"Gambia, The",0,-1.543332,4,2,7.501082
39,Norway,1,-6.907755,5,5,10.29833
49,Sri Lanka,0,-1.864343,2,2,8.627661
7,Brazil,1,-0.457034,4,3,9.375601
12,Colombia,0,-2.057821,3,2,9.191973
52,Syria,0,1.725166,1,1,9.664151
15,Cote d'Ivoire,1,-4.229065,4,2,8.228711
53,Thailand,0,-6.907755,3,2,8.548692
44,Portugal,1,-2.459625,4,3,9.444543
36,Morocco,0,-3.156958,3,1,8.78048


## Counting

### Counting data

To count data we need to:

```
dat["variable"].value_counts()
```

If we want it sorted, we can type:

```
dat["variable"].value_counts(sort = True)
```

We can also count proportions:

```
dat["variable"].value_counts(normalize = True)
```

Let's try?!

### Detecting missing data

We can also detect missing data using the function:

```
dat.isna()
```

And if we want, count the missing data by variable:

```
dat.isna().sum()
```

Ultimately, to remove the missing we should:

```
dat.dropna()
```

Or we can fill the missing with a custom value (proceed with caution here!)

```
dat.fillna(0)
```

In [20]:
PErisk.isna().sum()

country     0
courts      0
barb2       0
prsexp2     0
prscorr2    0
gdpw2       0
dtype: int64

In [7]:
# My code here
PErisk.head()

Unnamed: 0,country,courts,barb2,prsexp2,prscorr2,gdpw2
0,Argentina,0,-0.720775,1,3,9.69017
1,Australia,1,-6.907755,5,4,10.30484
2,Austria,1,-4.910337,5,4,10.10094
3,Bangladesh,0,0.775975,1,0,8.379768
4,Belgium,1,-4.617344,5,4,10.25012


In [13]:
PErisk['prsexp2'].value_counts(sort = False, normalize = True)

1    0.096774
5    0.225806
0    0.032258
4    0.225806
3    0.306452
2    0.112903
Name: prsexp2, dtype: float64

**Exercise**: Count the number of `tips` by week day. Then, normalize to have the proportions.

In [18]:
## Your answers here!
tips[['day', 'time']].value_counts()

day  time 
Sat  Night    87
Sun  Night    76
Thu  Day      61
Fri  Night    12
     Day       7
Thu  Night     1
dtype: int64

## Summary by groups

Suppose we want the mean of gdp by countries with and without courts. There are two ways:

```
# Hard way
perisk[perisk['courts'] == 0]['gdpw2'].mean()
perisk[perisk['courts'] == 1]['gdpw2'].mean()
```

Or, we can use the `groupby` function in Pandas:

```
# Easy way
perisk.groupby("courts")["gdpw2"].mean()
```

In [23]:
# My code here
PErisk.groupby('prsexp2')['gdpw2'].mean()

prsexp2
0     8.975829
1     8.483506
2     8.695454
3     8.613695
4     8.947314
5    10.139483
Name: gdpw2, dtype: float64

**Exercise**: In the `tips` dataset, compute the mean of tips by weekday.

In [26]:
## Your answers here!
tips.groupby(['day', 'time'])['tip'].sum()

day  time 
Fri  Day       16.68
     Night     35.28
Sat  Night    260.40
Sun  Night    247.39
Thu  Day      168.83
     Night      3.00
Name: tip, dtype: float64

### Summary by groups (multiple functions)

To group results by multiple functions, we can simply:

```
dat.groupby("var_group")["var_stat"].agg([stat1, stat2, stat3])
```

### Summary by groups (multiple levels)

To group results by multiple levels, we can simply:

```
dat.groupby(["varlevel1", "varlevel2"])["var_stat"].mean()
```

In [27]:
# My code here
PErisk.head()

Unnamed: 0,country,courts,barb2,prsexp2,prscorr2,gdpw2
0,Argentina,0,-0.720775,1,3,9.69017
1,Australia,1,-6.907755,5,4,10.30484
2,Austria,1,-4.910337,5,4,10.10094
3,Bangladesh,0,0.775975,1,0,8.379768
4,Belgium,1,-4.617344,5,4,10.25012


In [28]:
PErisk.groupby(['prscorr2'])['gdpw2'].agg([min, max, sum])

Unnamed: 0_level_0,min,max,sum
prscorr2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,7.095064,8.727616,41.178301
1,7.096721,9.664151,89.571847
2,7.501082,9.84882,157.378553
3,7.029973,10.26078,101.910794
4,9.167329,10.30484,78.866357
5,9.882724,10.41018,91.690384


**Exercise**: For the `tips` dataset:

1. Compute maximum, minimum, and sum of tips by weekday
2. Compute the sum of tips by weekday and day time.

In [9]:
## Your answers here!

## Indexing

To find the indexes we use:

```
dat.columns
dat.index
```

We can set index:

```
dat_ind = dat.set_index("var_index")
```

And to remove indexing:

```
dat_ind.reset_index()
```

The reason we index is because it makes subset simple:

```
# Hard way:
perisk[perisk["country"].isin(["Argentina", "Austria"])]

# Easy way:
perisk_ind.loc[["Argentina", "Austria"]]
```

Also, indexes do not need to be unique, and you can use multiple levels to index.

In [30]:
# My code here
PErisk.head()
PErisk_ind = PErisk.set_index('country')

In [31]:
PErisk_ind.head()

Unnamed: 0_level_0,courts,barb2,prsexp2,prscorr2,gdpw2
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Argentina,0,-0.720775,1,3,9.69017
Australia,1,-6.907755,5,4,10.30484
Austria,1,-4.910337,5,4,10.10094
Bangladesh,0,0.775975,1,0,8.379768
Belgium,1,-4.617344,5,4,10.25012


In [32]:
PErisk_ind.loc[['Argentina', 'Australia']]

Unnamed: 0_level_0,courts,barb2,prsexp2,prscorr2,gdpw2
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Argentina,0,-0.720775,1,3,9.69017
Australia,1,-6.907755,5,4,10.30484


**Exercise**: Index the data by the variable `obs`. Subset the observations 33 and 132.

In [33]:
## Your answers here!
tips_ind = tips.set_index('obs')
tips_ind.loc[[32, 131]]

Unnamed: 0_level_0,totbill,tip,sex,smoker,day,time,size
obs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
32,18.35,2.5,M,No,Sat,Night,4
131,19.08,1.5,M,No,Thu,Day,2


## Plots

Now, let's create some plots!

The library to create plots is the `matplotlib`. We can import this library easily in python:

```
from matplotlib import pyplot as plt
```

### Scatterplot

And for making a plot we need to:

```
plt.plot(dat.vx, dat.vy, kind="scatter")
plt.show()
```

If we want to add legends and change attributes:

```
plt.plot(dat.vx, dat.vy, kind="scatter")
plt.xlabel("X-axis name")
plt.ylabel("Y-axis name")
plt.title("Plot title")
plt.show()
```

In [12]:
# My code here

### Histogram

We can make a simple histogram using the function `.hist()`:

```
dat['variable'].hist()
plt.show()
```

And if we want overlapping histograms by a category:

```
dat[dat['vcat'] == 'v1']['variable'].hist()
dat[dat['vcat'] == 'v2']['variable'].hist()
plt.legend(["v1", "v2"])
plt.show()
```

Let's try?

In [13]:
# My code here

**Exercise**:

In [14]:
## Your answers here!

**Great job!!!**