# POLSCI 3

## Week 2, Lecture 2: Data summaries for continuous variables

In the first Notebook Lecture in class this week, we talked about using one-way and two-way tables to summarize data.

That was a useful way to handle summaries of **categorical variables**---those that take on a limited number of values (i.e., categories).  

But we saw that was not such a useful approach for **continuous variables**, or those that can take on many values.  

One possibility we discussed is collapsing the values of continuous variables into several categories. For intance, we can use GDP per capita to describe countries as "low income," "middle income," or "rich" and then use tables to summarize the numbers of countries in each category. 

Often, however, it is useful to retain the richer information included in continuous variables. 

In this Lecture, you'll learn two techniques in R to make one-way and two-way summaries for continuous variables, or those with many values. 
For one-way (one-dimensional) summaries we'll use **histograms**.  And for two-way (two-dimensional) summaries we'll use **scatterplots**.


### Loading the data:

As always, we can start by loading in our data. We are going to continue to work with the happiness dataset from the previous lecture.

In [None]:
happiness_data <- read.csv('happiness_polity_2018.csv')
head(happiness_data)

Again, here's the codebook:

- <code>polity2</code>: The "Polity Score" of the country, which measures its political system on a 21-pont scale ranging from -10 (hereditary monarchy) to +10 (consolidated democracy).
- <code>polity2_cat</code>: The political category the country is identified within. "autocracies" are on one end of the spectrum, "anocracies" are in the middle (semi-democracies), and "democracies" are at the top of the spectrum.
- <code>gdpcapita</code>: GDP per capita (economic output per person)
- <code>gdpcapita_cat</code>: GDP/income category that the country falls into (based on GDP per capita)
- <code>happiness</code>: The country's happiness index, measured through surveys that require participants to rank their level of happiness based on an assortment of quality-of-life factors
- <code>happiness_cat</code>: Happiness category that the country falls into (based on happpiness indicator)
- <code>life_expectancy</code>: Average life expectancy in years
- <code>life_expectancy_cat</code>: Life Expectancy category that the country falls into

## ***PART 1: HISTOGRAMS***

Now, what if we want to know how GDP per capita varies across countries in this data set?  We could simply print the values:

In [None]:
# remember, to extract the value of a variable, we use the general formula dataset_name$variable_name.  Try it by running this cell:
happiness_data$gdpcapita

That list of values is pretty nasty to look at!  And not very informative...

So let's try to look at these values graphically.  

We will use a special kind of bar chart where each bar represents the **frequency** of each value of GDP per capita.  

To create such a chart in R, you can use the command "hist()", where hist stands for histogram.  

Inside the parentheses goes the name of the variable for which you want to plot the distribution of values.  

Let's try it! 

In [None]:
# Let's plot the histogram for the GDP per capita variable
# Inside the parentheses, you need to use dataset_name$variable_name to extract the values of variable_name from dataset_name
# For GDP per capita in the happiness data we have
hist(happiness_data$gdpcapita)


Most of the countries in the dataset are pretty poor.  The per capita income (GDP) in over 80 of them is under $20,000 a year. (Later, we'll talk about how to depict the results in terms of percentages rather than raw numbers, or "absolute frequencies").

A short diversion.  You might be curious what countries are up there above 80,000.  

You can look into that using the subset command you already learned!

In [None]:
richest <- subset(happiness_data, gdpcapita>80000)
richest$countryname

Wow, Ireland, Luxembourg and Singapore! These countries have high-value economies but small populations---and thus, high GDP per capita. (Remember, "per capita" means "per person" or sometimes "per household").

Where does the US fit in this distribution?  You can use subset again to look at this!

In [None]:
us <- subset(happiness_data, countryname=="United States")
us$gdpcapita

The GDP per capita is just above $60,000. (This was the GDP per capita in the United States in 2018 -- it has increased since then).

Three further tips about histograms for now.  

First, note that by default R groups countries into bins. (The graph might not be as informative about the distribution of GDP per capita if every country were its own bin).  You can alter the number of bins by using the "breaks" option in "hist":

In [None]:
hist(happiness_data$gdpcapita, breaks=5) # Now there are more countries in each bin

Second, we can give the histogram a snappier title and x-axis label:

In [None]:
hist(happiness_data$gdpcapita, breaks=5, main="Distribution of GDP per capita across countries", xlab="GDP per capita")

Finally, these histograms show the "absolute frequency" -- i.e., the raw numbers -- of countries in each bin.

But sometimes, it is useful to know the percentage or proportion -- the "relative frequency" -- of countries in each bin.

A easy way to do this in R is to use a slightly different syntax. We can load a library called "lattice" and then use the syntax "histogram()" rather than "hist()".  By default, this sytax produces the result with percentages rather than raw numbers. The result is in color to boot!

In [None]:
library(lattice)
histogram(happiness_data$gdpcapita, breaks=5, main="Distribution of GDP per capita across countries", xlab="GDP per capita")

Note we can see now that over 60% of the countries have a per capita income under $20,000.

### ***Part 2: Scatterplots***

Now, suppose we want to know whether people in richer countries are happier.  

We could look at this question descriptively using categorical variables and tables, as we did in this week's first Notebook Lecture in class. For example, we could create a two-way table using the variables <code>gdpcapita_cat</code> and <code>happiness_cat</code>, both of which are categorical variables.  

However, sometimes it is useful to show the relationship between two continuous variables.  An easy way to do this is with scatterplots. 

In [None]:
# Try running the code in this cell to create a scatterplot

plot(happiness_data$gdpcapita, happiness_data$happiness)

It's a nice plot, but there are some things we could to make it visually more attractive.

In [None]:

options(scipen = 999) # This option prevents the use of scientific notation like e+04 in the x-axis

# And let's add a snappier title and lables; as with the histogram, we use the options main = ... , xlab = ... , ylab=...

plot(happiness_data$gdpcapita, happiness_data$happiness, main="Happiness and GPD per capita across countries", xlab = "GPD per capita", ylab = "Happiness") 

# Tip: if you can't see the full syntax in the line above, scroll to the right (in a Mac you use two fingers together and swipe right)

Ah that's better!

So high GDP per capita makes people happier, right?  

Not so fast!  We'll be talking more about cause and effect (and why correlation does not imply causation) in Weeks 4-5 and beyond...

But even if this analysis may not teach us much about causal effects, we've learned some really useful tools for looking at continuous variables graphically.  You will practice these skills, along with the creation of tables, in the in-class assignment on Thursday!