# POLI 175 - Machine Learning for Social Sciences

## Python Refresh I

---

# Data Science with Pandas

## Load Pandas

Load pandas is very easy. Provided that the package is installed (if not, check [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html) how to install it), type:

In [1]:
# My code here
import pandas as pd
import numpy as np

## Load Data into Python

To start having fun, we need to load data into Python. We can do this in three ways: from a local file, from the internet, and from data typed in the keyboard.

### From Locale

First, we need to find the working directory. To do that, we need to use the library `os`. To do this you need to:

```
import os
print(os.getcwd())
```

## Load Data into Python

Then, you need to put the file in the folder. If you need to change the folder, use the function:

```
os.chdir("new_path_here")
```

Now that we know the folder, and the file is there, we can load it:

```
dat = pd.read_csv('file_name_here.csv')
```

### Load Dataset on the internet

The way we will load here is from the internet. 

For example, suppose the following dataset: https://raw.githubusercontent.com/umbertomig/qtm150/master/datasets/PErisk.csv.

To open, we use the `read_csv` command as we did with the locale version.

In [20]:
# My code here
dat = pd.read_csv('https://raw.githubusercontent.com/umbertomig/POLI175public/main/data/tips.csv')


### From typing in the keyboard

We can also build a dataset from scratch.

For example, we could build a simple dataset in the following way:

```
dat = pd.DataFrame({
    "v1": ['d1', 'd2', 'd3'],
    "v2": [1, 2, 3],
    "v3": ['A', 'B', 'A'],
    "v4": [2.0, 1.1, 2.2]})
```

And this works for small datasets, with the inconvenience of having to type.

## Dataset Information

Suppose we have a pandas dataset called `dat`. To make it more realistic, use the following example:

```
# For me: PErisk
dat = pd.read_csv('https://raw.githubusercontent.com/umbertomig/qtm150/master/datasets/PErisk.csv')

# For you: tips
dat2 = pd.read_csv('https://raw.githubusercontent.com/umbertomig/qtm151/main/datasets/tips.csv')
```

If you are having VPN issues, let me know.

In [21]:
# For me: PErisk
dat = pd.read_csv('https://raw.githubusercontent.com/umbertomig/qtm150/master/datasets/PErisk.csv')

# For you: tips
dat2 = pd.read_csv('https://raw.githubusercontent.com/umbertomig/qtm151/main/datasets/tips.csv')

### .info(.)

This method prints the information about the content of a dataset.

Syntax and Usage: `print(dat.info())`

In [4]:
# My code here
print(dat.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   country   62 non-null     object 
 1   courts    62 non-null     int64  
 2   barb2     62 non-null     float64
 3   prsexp2   62 non-null     int64  
 4   prscorr2  62 non-null     int64  
 5   gdpw2     62 non-null     float64
dtypes: float64(2), int64(3), object(1)
memory usage: 3.0+ KB
None


### .head(.)

This method prints the first few observations of the dataset.

Syntax and Usage: `print(dat.head())`

In [5]:
# My code here
dat.head()

Unnamed: 0,country,courts,barb2,prsexp2,prscorr2,gdpw2
0,Argentina,0,-0.720775,1,3,9.69017
1,Australia,1,-6.907755,5,4,10.30484
2,Austria,1,-4.910337,5,4,10.10094
3,Bangladesh,0,0.775975,1,0,8.379768
4,Belgium,1,-4.617344,5,4,10.25012


### .shape

This prints the number of rows and columns of a dataset.

Syntax and Usage: `print(dat.shape)`

Note: no parenthesis necessary.

In [6]:
# My code here
dat.shape

(62, 6)

### .describe(.)

This method gives us a few summary statistics of the dataset.

Syntax and Usage: `print(dat.describe())`

In [7]:
# My code here
dat.describe()

Unnamed: 0,courts,barb2,prsexp2,prscorr2,gdpw2
count,62.0,62.0,62.0,62.0,62.0
mean,0.451613,-2.925557,3.274194,2.532258,9.041875
std,0.501716,2.707211,1.369089,1.501013,0.970264
min,0.0,-6.907755,0.0,0.0,7.029973
25%,0.0,-4.894882,3.0,1.25,8.381027
50%,0.0,-2.353233,3.0,2.0,9.185412
75%,1.0,-1.301007,4.0,4.0,9.88928
max,1.0,2.337425,5.0,5.0,10.41018


### .values

This prints the observations in the dataset.

Syntax and Usage: `print(dat.values)`

Note: no parenthesis necessary.

In [8]:
# My code here
dat.values

array([['Argentina', 0, -0.7207754, 1, 3, 9.69017],
       ['Australia', 1, -6.907755, 5, 4, 10.30484],
       ['Austria', 1, -4.910337, 5, 4, 10.10094],
       ['Bangladesh', 0, 0.7759748, 1, 0, 8.379768],
       ['Belgium', 1, -4.617344, 5, 4, 10.25012],
       ['Bolivia', 0, -2.46144, 0, 0, 8.583543],
       ['Botswana', 1, -1.244868, 4, 3, 8.77771],
       ['Brazil', 1, -0.4570337, 4, 3, 9.375601],
       ['Burma', 0, 1.604343, 3, 1, 7.096721],
       ['Cameroon', 0, -4.229065, 3, 1, 8.120886],
       ['Canada', 1, -6.907755, 5, 5, 10.41018],
       ['Chile', 1, -1.542761, 3, 2, 9.261224],
       ['Colombia', 0, -2.057821, 3, 2, 9.191973],
       ['Congo-Kinshasa', 0, -2.323288, 1, 0, 7.095064],
       ['Costa Rica', 1, -5.090003, 3, 4, 9.167329],
       ["Cote d'Ivoire", 1, -4.229065, 4, 2, 8.228711],
       ['Denmark', 1, -6.907755, 5, 5, 10.10651],
       ['Dominican Republic', 0, -2.378862, 2, 2, 8.899731],
       ['Ecuador', 1, -1.845337, 3, 2, 9.117786],
       ['Finland', 1,

### .columns

This prints the variables information of the dataset.

Syntax and Usage: `print(dat.columns)`

Note: no parenthesis necessary.

In [9]:
# My code here
dat.columns

Index(['country', 'courts', 'barb2', 'prsexp2', 'prscorr2', 'gdpw2'], dtype='object')

### .index

This prints informations about the dataset rows.

Syntax and Usage: `print(dat.index)`

Note: no parenthesis necessary.

In [10]:
# My code here
dat.index

RangeIndex(start=0, stop=62, step=1)

**Exercise**: Run the same examples for the dataset `dat2`

In [11]:
## Your answers here!
dat2.describe()


Unnamed: 0,obs,totbill,tip,size
count,244.0,244.0,244.0,244.0
mean,122.5,19.785943,2.998279,2.569672
std,70.580923,8.902412,1.383638,0.9511
min,1.0,3.07,1.0,1.0
25%,61.75,13.3475,2.0,2.0
50%,122.5,17.795,2.9,2.0
75%,183.25,24.1275,3.5625,3.0
max,244.0,50.81,10.0,6.0


## Data Manipulation

### Subsetting variables (columns)

To subset variables the sintax is simple. When it is only one variable:

```
dat["var_name"]
```

When it is two or more, you need to enclose them in a list:

```
dat[["var1", "var2"]]
```

In [27]:
# My code here
dat.head()
dat[['courts', 'barb2']].head()

Unnamed: 0,courts,barb2
0,0,-0.720775
1,1,-6.907755
2,1,-4.910337
3,0,0.775975
4,1,-4.617344


### Subsetting cases (rows)

Now, to work with cases, notice that pandas allows us to do vectorized operations. For instance:

```
dat["var_name"] > some_number
```

Returns True, if the variable is greater than the number, and False otherwise. To subset the dataset, you need to:

```
dat[dat["var_name"] > some_number]
```

### Subsetting cases (rows)

For multiple comparisons, the syntax is also easy to use:

```
dat[ (dat["v1"] == "some_value") & (dat["v2"] == "some_other_value") ]
```

And if we want a command similar to `%in%` in R, we can use the `.isin(.)` method:

```
dat[ dat["v1"].isin(["some_value", "some_other_value"]) ]
```

In [40]:
# My code here
dat[(dat['prsexp2'].isin([4, 5])) & (dat['prscorr2'].isin([4, 5]))]


Unnamed: 0,country,courts,barb2,prsexp2,prscorr2,gdpw2
1,Australia,1,-6.907755,5,4,10.30484
2,Austria,1,-4.910337,5,4,10.10094
4,Belgium,1,-4.617344,5,4,10.25012
10,Canada,1,-6.907755,5,5,10.41018
16,Denmark,1,-6.907755,5,5,10.10651
19,Finland,1,-6.907755,5,5,10.12367
27,Ireland,1,-6.907755,5,4,9.891465
28,Israel,0,-2.319996,4,4,10.06777
30,Japan,1,-6.907755,5,4,9.892022
37,New Zealand,1,-6.907755,5,5,10.17626


**Exercise**: Filter the `tips` dataset (our `dat2`) by:

1. Bills of more than 10 dollars
2. Smokers
3. Weekend

Do each of these separately, then do all together.

In [14]:
## Your answers here!

### Simple computations

It is simple to create new variables from older ones.

```
# Summing two variables
dat["my_new_var"] = dat["my_old_var1"] + dat["my_old_var2"]

# Multiplying by a constant
dat["my_new_var"] = dat["my_old_var1"] * constant

# Apply some numpy function (try to always use numpy functions, as pandas is based on numpy)
import numpy as np
dat["my_new_logged_var"] = np.log(dat["my_old_var"])
```

In [45]:
# My code here
dat['gdppc'] = np.exp(dat['gdpw2'])
dat.head()

Unnamed: 0,country,courts,barb2,prsexp2,prscorr2,gdpw2,gdppc
0,Argentina,0,-0.720775,1,3,9.69017,16157.990919
1,Australia,1,-6.907755,5,4,10.30484,29876.873543
2,Austria,1,-4.910337,5,4,10.10094,24365.902611
3,Bangladesh,0,0.775975,1,0,8.379768,4357.997753
4,Belgium,1,-4.617344,5,4,10.25012,28285.936029


**Exercise**: In the `tips` dataset, create the variable `prop_tip`, which is the proportion of the tip with relation to the total bill.

In [16]:
## Your answers here!

## Statistics

We can easily compute statistics from the data. Here are a few methods that we have available:

| Method           | Description                  |
|------------------|------------------------------|
| `.median()`      | Median                       |
| `.mean()`        | Mean                         |
| `.min()`         | Minimum                      |
| `.max()`         | Maximum                      |
| `.var()`         | Variance                     |
| `.std()`         | Standard Deviation           |
| `.sum()`         | Sum values                   |
| `.mode()`        | More frequent values         |
| `.quantile(val)` | Quantile value (btw 0 and 1) |

## Statistics

In [52]:
# My code here
dat.head()
dat['gdppc'].sum()

767443.8833463824

**Exercise**: For the `tips` dataset:

1. Compute the mean and median of tip
2. Compute the mode of day
3. Compute the first quartile of the totbill.

In [18]:
## Your answers here!

## Questions?

## Great job! See you next class!