# CSS 120 - Environmental Data Science

## Lecture 04

### Umberto Mignozzetti

## Today's Class

So far, we studied the history and why data science is useful for environmental sciences.

Now we are going to study how to do Exploratory Data Analysis in Python.

## Exploratory Data Analysis

## Exploratory Data Analysis

1. **Understanding Data**: Exploratory Data Analysis (EDA) involves examining the data to understand its structure, distribution, and main characteristics, typically using statistical summaries and graphical representations.

2. **Identifying Patterns and Anomalies**: It focuses on detecting patterns, trends, relationships, and anomalies in the data, which can be crucial for formulating hypotheses and guiding further analysis.

3. **Preparing for Advanced Analysis**: EDA serves as a preliminary step before more complex statistical modeling or machine learning, ensuring that the data is well-understood and appropriately prepared for such analyses.

## Load Pandas

Load pandas is very easy. Provided that the package is installed (if not, check [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html) how to install it), type:

In [None]:
# My code here
import pandas as pd

## Load Data into Python

To start having fun, we need to load data into Python. We can do this in three ways: from a local file, from the internet, and from data typed in the keyboard.

### From Locale

First, we need to find the working directory. To do that, we need to use the library `os`. To do this you need to:

```
import os
print(os.getcwd())
```

Then, you need to put the file in the folder. If you need to change the folder, use the function:

```
os.chdir("new_path_here")
```

### From Locale

Now that we know the folder, and the file is there, we can load it:

```
dat = pd.read_csv('file_name_here.csv')
```

Here I will load CSV files, but Pandas has the ability to load files from other formats, such as Excel, SPSS, R, and others.

In [None]:
# My code here
import os

In [None]:
print(os.getcwd())

### From the internet

The way we will load here is from the internet. 

For example, suppose the following dataset: https://raw.githubusercontent.com/umbertomig/POLI30Dpublic/main/datasets/stokes_electoral_2015.csv.

To open, we use the `read_csv` command as we did with the locale version.

### From typing in the keyboard

We can also build a dataset from scratch.

For example, we could build a simple dataset in the following way:

```
dat = pd.DataFrame({
    "v1": ['d1', 'd2', 'd3'],
    "v2": [1, 2, 3],
    "v3": ['A', 'B', 'A'],
    "v4": [2.0, 1.1, 2.2]})
```

And this works for small datasets, with the inconvenience of having to type.

In [None]:
# Let's do it?!

## Dataset Information

Suppose we have a pandas dataset called `dat`. To make it more realistic, use the following example:

```
# For me: QOG-EI
dat = pd.read_csv('qog_ei2.csv')

# For you: NIMBY Wind Turbines
nimby = pd.read_csv('nimby.csv')
```

If you are having VPN issues, I put the data on Canvas, so you can download it and load from your own computer.

In [None]:
# My code here
dat = pd.read_csv('qog_ei2.csv')
dat.head()

In [None]:
# Your code here: Load (nimby.csv)

### .info(.)

This method prints the information about the content of a dataset.

Syntax and Usage: `print(dat.info())`

In [None]:
# My code here
dat.info()

### .head(.)

This method prints the first few observations of the dataset.

Syntax and Usage: `print(dat.head())`

In [None]:
# My code here
dat.head()

### .shape

This prints the number of rows and columns of a dataset.

Syntax and Usage: `print(dat.shape)`

Note: no parenthesis necessary.

In [None]:
# My code here
print(dat.shape)

In [None]:
list(dat.columns)

### .describe(.)

This method gives us a few summary statistics of the dataset.

Syntax and Usage: `print(dat.describe())`

In [None]:
# My code here
dat.describe()

### .values

This prints the observations in the dataset.

Syntax and Usage: `print(dat.values)`

Note: no parenthesis necessary.

In [None]:
# My code here
print(dat.values)

### .columns

This prints the variables information of the dataset.

Syntax and Usage: `print(dat.columns)`

Note: no parenthesis necessary.

In [None]:
# My code here
print(dat.columns)

### .index

This prints informations about the dataset rows.

Syntax and Usage: `print(dat.index)`

Note: no parenthesis necessary.

In [None]:
# My code here
print(dat.index)

**Exercise**: Run the same examples for the dataset `nymby`

In [None]:
## Your answers here!

## Data Manipulation

### Subsetting variables (columns)

To subset variables the sintax is simple. When it is only one variable:

```
dat["var_name"]
```

When it is two or more, you need to enclose them in a list:

```
dat[["var1", "var2"]]
```

In [None]:
# My code here
dat.head()

In [None]:
dat["emdat_ndis"]

In [None]:
var = dat["emdat_ndis"]
print(var)

In [None]:
dat.head()

In [None]:
dat[['cname', 'cckp_rain', 'cckp_temp']]

### Subsetting cases (rows)

Now, to work with cases, notice that pandas allows us to do vectorized operations. For instance:

```
dat["var_name"] > some_number
```

Returns True, if the variable is greater than the number, and False otherwise. To subset the dataset, you need to:

```
dat[dat["var_name"] > some_number]
```

And that's it! For multiple comparisons, the syntax is also easy to use:

```
dat[ (dat["v1"] == "some_value") & (dat["v2"] == "some_other_value") ]
```

And if we want a command similar to `%in%` in R, we can use the `.isin(.)` method:

```
dat[ dat["v1"].isin(["some_value", "some_other_value"]) ]
```

In [None]:
# My code here
dat.head()

In [None]:
dat['year'] > 2015

In [None]:
dat[dat['year'] > 2015]

In [None]:
print(dat.head())
(dat['year'] > 2015) & (dat['cname'] == "Canada")

In [None]:
dat[(dat['year'] > 2015) & (dat['cname'] == "Canada")]

In [None]:
dat['cname'].isin(['Brazil', 'Canada', 'Finland'])

In [None]:
dat[dat['cname'].isin(['Brazil', 'Canada', 'Finland'])]

### Simple computations

It is simple to create new variables from older ones.

```
# Summing two variables
dat["my_new_var"] = dat["my_old_var1"] + dat["my_old_var2"]

# Multiplying by a constant
dat["my_new_var"] = dat["my_old_var1"] * constant

# Apply some numpy function (try to always use numpy functions, as pandas is based on numpy)
import numpy as np
dat["my_new_logged_var"] = np.log(dat["my_old_var"])
```

In [None]:
# My code here
dat.head()

In [None]:
dat['cckp_temp_fahr'] = 1.8 * dat['cckp_temp'] + 32.0
dat.head()

## Statistics

We can easily compute statistics from the data. Here are a few methods that we have available:

| Method           | Description                  |
|------------------|------------------------------|
| `.median()`      | Median                       |
| `.mean()`        | Mean                         |
| `.min()`         | Minimum                      |
| `.max()`         | Maximum                      |
| `.var()`         | Variance                     |
| `.std()`         | Standard Deviation           |
| `.sum()`         | Sum values                   |
| `.mode()`        | More frequent values         |
| `.quantile(val)` | Quantile value (btw 0 and 1) |

In [None]:
# My code here
dat.head()

In [None]:
dat['cckp_temp'].mean()

In [None]:
dat['cckp_temp'].median()

In [None]:
dat['emdat_ndis'].mode()

In [None]:
dat['cckp_temp'].quantile(0.25)

In [None]:
dat['cckp_temp'].quantile(0.90)

In [None]:
dat['cckp_temp'].quantile(0.10)

## Counting

### Counting data

To count data we need to:

```
dat["variable"].value_counts()
```

If we want it sorted, we can type:

```
dat["variable"].value_counts(sort = True)
```

We can also count proportions:

```
dat["variable"].value_counts(normalize = True)
```

In [None]:
# My code here
dat['ross_gas_exp'].value_counts(sort = False, normalize = True)

In [None]:
dat['ross_gas_exp'].value_counts(sort = False)

In [None]:
dat['ross_gas_exp'].value_counts()

### Detecting missing data

We can also detect missing data using the function:

```
dat.isna()
```

And if we want, count the missing data by variable:

```
dat.isna().sum()
```

Ultimately, to remove the missing we should:

```
dat.dropna()
```

Or we can fill the missing with a custom value (proceed with caution here!)

```
dat.fillna(0)
```

In [None]:
dat.isna()

In [None]:
dat

In [None]:
dat.isna().sum()

## Plots

Now, let's create some plots!

The library to create plots is the `matplotlib`. We can import this library easily in python:

```
from matplotlib import pyplot as plt
```

In [None]:
# My code here
from matplotlib import pyplot as plt

In [None]:
dat.head()

### Histogram

We can make a simple histogram using the function `.hist()`:

```
dat['variable'].hist()
plt.show()
```

And if we want overlapping histograms by a category:

```
dat[dat['vcat'] == 'v1']['variable'].hist()
dat[dat['vcat'] == 'v2']['variable'].hist()
plt.legend(["v1", "v2"])
plt.show()
```

Let's try?

In [None]:
# My code here
dat.cckp_rain.hist()

In [None]:
dat.cckp_temp.hist()

### Scatterplot

And for making a plot we need to:

```
plt.scatter(dat.vx, dat.vy)
plt.show()
```

If we want to add legends and change attributes:

```
plt.scatter(dat.vx, dat.vy)
plt.xlabel("X-axis name")
plt.ylabel("Y-axis name")
plt.title("Plot title")
plt.show()
```

In [None]:
plt.scatter(dat.cckp_rain, dat.cckp_temp)


## Great work!