## POLSCI 3 Lecture Notebook 1: Descriptive Statistics

In this notebook we will learn new tools for examining the properties of our data. What are the usual values that occur? Which data are outliers? What do the data "look like"? We will begin by reading in the electoral data from last lecture.

In [None]:
# This stores, or assigns, the dataset as election.data
election.data <- read.csv('FairFPSR3.csv')

Here is the **codebook** that tells you what each variable means.

`inc_vote`: % of major party presidential vote won by incumbent party

`year`: Year of the presidential election

`inflation`: Inflation rate

`goodnews`: Number of quarters in the first 15 quarters of admin in which econ growth>3.2%

`growth`: % change in real GDP per capita

Sometimes we just want to look at the values of individual variables (columns) in our data. To determine that we need to refer to individual variables (columns) in the dataset. Recall from last time that we referred to individual variables as `dataSetName$varName`. To list the values of the `inc_vote` variable:

In [None]:
election.data$inc_vote

Something we often want to know are the average values of the data. The average makes most sense with data like `inc_vote` or `growth` or `inflation`, which are "continuous variables" where it makes sense to add and subtract values from one another. 

For a variable $X$, its mean, $\overline{X}$, is given by:

$$\overline{X} = \frac{\sum_{i=1}^{n}X_i}{n}$$

To take the mean (average) in R we use `mean(dataSetName$varName)`:


In [None]:
# This computes the mean (average) of the inc_vote column of the data
mean(election.data$inc_vote)

We can assign the mean to a new variable, mean.inc_vote:

In [None]:
mean.inc_vote <- mean(election.data$inc_vote)

When we have categorical variables with values such as `male` and `female` (variable name `sex`) or `Catholic` and `Protestant` (variable name `religion`) it doesn't make sense to compute an average. Instead we usually want to know the number or proportion of observations in each category. To compute this we need the `table` command: `table(dataSetName$varName)`.

To see how this works we are going to read in a new dataset where the observations are countries. 
Let's read in those data:

In [None]:
happiness_data <- read.csv('happiness_polity_2018.csv')
head(happiness_data)

This dataset contains data from countries around the world in terms of factors such as happiness levels, 
political categories, and demographic information. Here's a codebook:

<code>polity2</code>: The "Polity Score" of the country, which measures its political system on a 21-pont scale 
ranging from -10 (hereditary monarchy) to +10 (consolidated democracy).

<code>polity2_cat</code>: The political category the country is identified within. "autocracies" are on one end of 
the spectrum, "anocracies" are in the middle (semi-democracies), and "democracies" are at the top of the spectrum.

<code>gdpcapita</code>: GDP per Capita (economic output per person)

<code>gdpcapita_cat</code>: GDP/income category that the country falls into (based on GDP per capita)

<code>happiness</code>: The country's happiness index, measured through surveys that require participants to rank 
their level of happiness based on an assortment of quality-of-life factors

<code>happiness_cat</code>: Happiness category that the country falls into (based on happpiness indicator)

<code>life_expectancy</code>: Average life expectancy in years

<code>life_expectancy_cat</code>: Life Expectancy category that the country falls into



Suppose we want to see what the most common global income category is. We are going to use the table() function with an argument corresponding to the column we want to interpret.

## One-Way Tables

In [None]:
table(happiness_data$gdpcapita_cat)

Suppose we want to know how many democracies and dictatorships there are: 

In [None]:
table(happiness_data$polity2_cat)

This example also shows why categorical variables can be useful. It's really not that useful to make a table of all the values of gdpcapita:

In [None]:
table(happiness_data$gdpcapita)

Tables are one of the easiest ways to understand categorical variables in a dataset. For example, we learned above that middle-income countries form the majority in the dataset. Another thing we might want to know is what *proportion* of values are in each category. For that we use the `prop.table` command:

In [None]:
prop.table(table(happiness_data$gdpcapita_cat))

### Two-Way Tables 

Two-way tables allow us to see how values of one variable correspond with another. For example, instead of just looking at polity, we can look at polity scores among countries of each income level in one table. 

To do this, we use the same `table()` command, but instead we give it *two* variables, separated by a comma. The first variable is the variable that we want to go along each row, and the second is the variable we want to go along each column.

Let's group GDP categories with political categories to see how many countries fit into each of the groupings:

In [None]:
table(happiness_data$polity2_cat, happiness_data$gdpcapita_cat)

Let's use `prop.table` to get a more intuitive look at the relationship between regime type and income:

In [None]:
prop.table(table(happiness_data$polity2_cat, happiness_data$gdpcapita_cat))

## Review

We learned three new R functions today:

- `table(dataset$variable)`
- `table(dataset$first.variable, dataset$second.variable)`
- `prop.table(table(dataset$first.variable, dataset$second.variable))`


#### Reminder about Peer Consulting Office Hours

If you had trouble with any content in this notebook, Data Peer Consultants are here to help! You 
can view their locations and availabilites at this link: https://data.berkeley.edu/degrees/peer-advising.
Peer Consultants are there to answer all data-related questions, whether it be about the content of this notebook,
applications of data science in the world or other data science courses offered at Berkeley -- 
make sure to take advantage of this wonderful resource!