# Dataframes with Pandas

Dataframes are the common format for social science data, whether you are using SAS, Excel, R, SPSS, Python or any other analytical software. Imagine a dataframes as a table of values where, generally speaking, each column is a variable and each row is a unit of observation. In this chapter, you will become familiar with using dataframes by creating some imaginary data before moving to real datasets from the American Community Survey. 

## Vectors

The smallest bit of data is a single observation, and it is not very useful on its own: _"Person X's height is 177 centimeters."_ Multiple observations like this, however, are more useful. If a medical researcher recorded the height of twenty patients, we would have one variable's worth of data. If this researcher recorded those observations into a spreadsheet, a script of code, or even some paper, as a series of consecutive values, they would have a vector of data. A vector is a one-dimensional series of values; it has one dimension in the sense that there is one data point over $x$ number of observations. So our example vector with twenty people's recorded heights has the shape $1 \times 20$.

Be aware that the terms used to describe a vector will change across different programming languages. In Python, a one-dimensional series of data points is called an __array__. Although the concept and shape of the data are essentially the same, there are other forms and names for a one-dimensional vector in Python programming, which have their own distinct uses and quirks. In this book, we will use the terms vector and array interchangeably.

In the cell below, you can create an array for the recorded height of twenty patients. Notice that the code assigns a series of values for height in centimeters, each separated by commas, to an object we decided to call `patient_height`. Note also that the values are encased in square brackets `[ ... ]`. Try writing a new line, copying code below, and replace the square brackets with curly brackets `{ ... }` and see how the cell's output changes. This is because Python generates arrays with square brackets `[]`, and unordered __sets__ with curly brackets `{}`.

In [1]:
# This code cell will be in every one of our chapters in Jupyter Notebook
# The function allows you to see every line of output when the code has multiple lines
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [2]:
patient_height = [177, 174, 170, 183, 168, 182, 163, 191, 177, 176, 173, 186, 174, 168, 184, 170, 170, 192, 181, 173]
patient_height

[177,
 174,
 170,
 183,
 168,
 182,
 163,
 191,
 177,
 176,
 173,
 186,
 174,
 168,
 184,
 170,
 170,
 192,
 181,
 173]

*All of these 'observations' for height were generated at random with imaginary averages. At the end of the chapter we provide code you can use to generate random values around a mean.*

## Pandas Series

If you think about it, a single vector of observations for patient height is also not that useful. Researchers are probably interested in multiple measurements like weight, daily calories, minutes of physical activity, etc. You can create these individually with the Series function in pandas. Yes, this is yet another name for a vector or an array, and their one-dimensional shape and their usefulness are still the same. 

In [3]:
import pandas as pd

patient_height = pd.Series([177, 174, 170, 183, 168, 182, 163, 191, 177, 176, 173, 186, 174, 168, 184, 170, 170, 192, 181, 173])

patient_weight = pd.Series([99.38, 69.31, 58.4, 75.42, 86.91, 66.39, 74.57, 87.31, 73.1, 82.97, 85.89, 79.16, 59.82, 78.9, 93.53, 97.3, 51.95, 109.97, 89.79, 85.17])

weekly_activity = pd.Series([153, 541, 373, 246, 312, 123, 313, 295, 139, 328, 133, 191, 112, 150, 172, 401, 460, 395, 196, 121])

daily_calories = pd.Series([2418, 2830, 2113, 2022, 2255, 2555, 1945, 2381, 2379, 2178, 2164, 1652, 2448, 1922, 2006, 2391, 2110, 2421, 2522, 1815])

print(patient_weight)

0      99.38
1      69.31
2      58.40
3      75.42
4      86.91
5      66.39
6      74.57
7      87.31
8      73.10
9      82.97
10     85.89
11     79.16
12     59.82
13     78.90
14     93.53
15     97.30
16     51.95
17    109.97
18     89.79
19     85.17
dtype: float64


Now our vectors print with a series of numbers from 0 to 20 to the left. These numbers represent the very important aspect of observation order, in which the order of observations matches the order in all the other vectors that measure the same thing. Python also tells us that the vector contains "float64" values, which simply means a real number in the billions or the billionths. With a collection of vectors all measuring the same twenty patients in the same order, we can join these together into a two-dimensional table called a dataframe. 

## Pandas Dataframes

To put it very simply, aggregating two or more vectors makes a dataframe, with some important caveats for data quality. First, vectors must be the same length or else the number of observations wouldn't match. Second, we need to be sure that the data from different vectors correspond to the same unit of observation. E.g.: That height<sub>i</sub> and weight<sub>i</sub> are both measurements of the same person. 

Now that we have four vectors for our fictitious 20 person study, how would we go about analyzing the data? Well, some statistical analyses can be performed with just the arrays we created before. But for most data analysis applications, we would need to put these arrays together into a table where each vector is be a vertical column. We can do this with the `concat()` function in pandas.

In [4]:
df1 = pd.concat([patient_height, patient_weight, weekly_activity, daily_calories], axis=1)
df1

Unnamed: 0,0,1,2,3
0,177,99.38,153,2418
1,174,69.31,541,2830
2,170,58.4,373,2113
3,183,75.42,246,2022
4,168,86.91,312,2255
5,182,66.39,123,2555
6,163,74.57,313,1945
7,191,87.31,295,2381
8,177,73.1,139,2379
9,176,82.97,328,2178


Note that we used the argument `axis=1` at the end of the `concat()` function to merge the vectors horizontally. Without this argument, `concat()` would have merged all the observations into a single column 80 data points long. Also notice that after the function's open parenthesis, we provided our list of vectors inside square brackets. The square brackets contain a series of vectors separated by commas, while a single comma in the round parentheses indicates the next argument in the concat function, which would confuse the concat() function and throw an error.

This table is pretty bare-bones, but it is a dataframe: It is a $4 \times 20$ table where every column is of equal length, and each observation in every column corresponds to the same unit of observation on that row. The problem is that it has no variable names to make the data a little more legible. To include variable names, we can do a number of things. Observe how the different steps below all result in the exact same dataframe whether we use:
- The `name=" "` argument within the `pd.Series()` function.
- The `.name=" "` function to rename a series with or without a current name.
- The `pd.DataFrame()` function to create a dataframe from scratch in one command.

In [5]:
# We can assign a name to the pandas Series the moment we create them
patient_height = pd.Series([177, 174, 170, 183, 168, 182, 163, 191, 177, 176, 
                            173, 186, 174, 168, 184, 170, 170, 192, 181, 173], name = "height")
patient_weight = pd.Series([99.38, 69.31, 58.4, 75.42, 86.91, 66.39, 74.57, 
                            87.31, 73.1, 82.97, 85.89, 79.16, 59.82, 78.9, 
                            93.53, 97.3, 51.95, 109.97, 89.79, 85.17], name = "weight")
weekly_activity = pd.Series([153, 541, 373, 246, 312, 123, 313, 295, 139, 
                             328, 133, 191, 112, 150, 172, 401, 460, 395, 196, 121], name = "weekly activity")
daily_calories = pd.Series([2418, 2830, 2113, 2022, 2255, 2555, 1945, 2381, 
                            2379, 2178, 2164, 1652, 2448, 1922, 2006, 2391, 
                            2110, 2421, 2522, 1815], name = "daily calories")

df1_1 = pd.concat([patient_height, patient_weight, weekly_activity, daily_calories], axis=1)

df1_1

Unnamed: 0,height,weight,weekly activity,daily calories
0,177,99.38,153,2418
1,174,69.31,541,2830
2,170,58.4,373,2113
3,183,75.42,246,2022
4,168,86.91,312,2255
5,182,66.39,123,2555
6,163,74.57,313,1945
7,191,87.31,295,2381
8,177,73.1,139,2379
9,176,82.97,328,2178


In [6]:
# We can assign a name to the pandas Series after creating them
patient_height = pd.Series([177, 174, 170, 183, 168, 182, 163, 191, 177, 176, 173, 186, 174, 168, 184, 170, 170, 192, 181, 173])
patient_height.name = "height"

patient_weight = pd.Series([99.38, 69.31, 58.4, 75.42, 86.91, 66.39, 74.57, 87.31, 73.1, 82.97, 85.89, 79.16, 59.82, 78.9, 93.53, 97.3, 51.95, 109.97, 89.79, 85.17])
patient_weight.name = "weight"

weekly_activity = pd.Series([153, 541, 373, 246, 312, 123, 313, 295, 139, 328, 133, 191, 112, 150, 172, 401, 460, 395, 196, 121])
weekly_activity.name = "weekly activity"

daily_calories = pd.Series([2418, 2830, 2113, 2022, 2255, 2555, 1945, 2381, 2379, 2178, 2164, 1652, 2448, 1922, 2006, 2391, 2110, 2421, 2522, 1815])
daily_calories.name = "daily calories"

df1_2 = pd.concat([patient_height, patient_weight, weekly_activity, daily_calories], axis=1)

df1_2

Unnamed: 0,height,weight,weekly activity,daily calories
0,177,99.38,153,2418
1,174,69.31,541,2830
2,170,58.4,373,2113
3,183,75.42,246,2022
4,168,86.91,312,2255
5,182,66.39,123,2555
6,163,74.57,313,1945
7,191,87.31,295,2381
8,177,73.1,139,2379
9,176,82.97,328,2178


In [7]:
# Or we can create a dataframe wholesale with pd.DataFrame()
df1_3 = pd.DataFrame(
    {
        "height": [177, 174, 170, 183, 168, 182, 163, 191, 177, 176, 173, 186, 174, 168, 184, 170, 170, 192, 181, 173],
        "weight": [99.38, 69.31, 58.4, 75.42, 86.91, 66.39, 74.57, 87.31, 73.1, 82.97, 85.89, 79.16, 59.82, 78.9, 93.53, 97.3, 51.95, 109.97, 89.79, 85.17],
        "weekly activity": [153, 541, 373, 246, 312, 123, 313, 295, 139, 328, 133, 191, 112, 150, 172, 401, 460, 395, 196, 121],
        "daily calories": [2418, 2830, 2113, 2022, 2255, 2555, 1945, 2381, 2379, 2178, 2164, 1652, 2448, 1922, 2006, 2391, 2110, 2421, 2522, 1815]
    }
)

df1_3

Unnamed: 0,height,weight,weekly activity,daily calories
0,177,99.38,153,2418
1,174,69.31,541,2830
2,170,58.4,373,2113
3,183,75.42,246,2022
4,168,86.91,312,2255
5,182,66.39,123,2555
6,163,74.57,313,1945
7,191,87.31,295,2381
8,177,73.1,139,2379
9,176,82.97,328,2178


Whichever method you choose gives the same result; Python and pandas are flexible. 

## American Community Survey

You are going to work with some real life data that the U.S. Census Bureau collects every year called the [American Community Survey](https://www.census.gov/data/developers/data-sets.html). Using only Python code, you will be able to download the raw data to your computer, load it to the Jupyter Notebook, select relevant variables, subset relevant observations, and make a new variable with basic vector math. ACS data is very wide; each year has multiple datasets for the type of location the census is counting, and each dataset has hundreds of variables to choose from. Additionally, there are datasets for yearly, three-year, and five-year collections. 

The dataset we will be using is the [ACS 1-year data](https://www.census.gov/data/developers/data-sets/acs-1year.html), "Selected Social Characteristics in the United States" for 2022. ACS provides their data tables in [Comma separated values]() format, or CSV. This is essentially a text file where every variable is separated by a comma, and each row/observation is a single line of text. These two features are all Python and pandas need to interpret CSV files into a dataframe. 

### Loading
We can load this data into our Notebook environment in many different ways using the pandas function `.read_csv()`: The function allows you to specify _where_ the csv file is located on your computer, locally or remotely, with a web URL or a file path. 

In [8]:
ACS_2022 = pd.read_csv('../../Data/ACS/DP02/County/ACSDP1Y2022.DP02-Data.csv')

Now we have an object in our environment called `ACS_2022`, but we can't really see it unless we use a few different commands that tell us important information about out shiny new dataframe. We'll execute all of these in the cell below. Their outputs represent:
- `type()`: our dataframe is a pandas dataframe.
- `.shape`: the dataframe is 849 rows/observations long and 619 variables/columns wide.
- `.dtypes()`: the type of every variable in the dataframe.
- `.info()`: a detailed explanation of the entire dataframe.

The most useful will be `.info()` which tells us all the information of the first three commands in one go. 

In [9]:
type(ACS_2022)
ACS_2022.shape
ACS_2022.dtypes
ACS_2022.info()

pandas.core.frame.DataFrame

(849, 619)

GEO_ID           object
NAME             object
DP02_0001E       object
DP02_0001M       object
DP02_0002E       object
                 ...   
DP02_0153PE      object
DP02_0153PM      object
DP02_0154PE      object
DP02_0154PM      object
Unnamed: 618    float64
Length: 619, dtype: object

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 849 entries, 0 to 848
Columns: 619 entries, GEO_ID to Unnamed: 618
dtypes: float64(1), object(618)
memory usage: 4.0+ MB


We can also preview the data with `.head()` or `.tail()` and specify the number of rows we want to look at from the first or last few observations. However, this dataframe is so massive that we get a truncated number of columns, making the head() function less useful. 

In [10]:
ACS_2022.head(10)

Unnamed: 0,GEO_ID,NAME,DP02_0001E,DP02_0001M,DP02_0002E,DP02_0002M,DP02_0003E,DP02_0003M,DP02_0004E,DP02_0004M,...,DP02_0150PM,DP02_0151PE,DP02_0151PM,DP02_0152PE,DP02_0152PM,DP02_0153PE,DP02_0153PM,DP02_0154PE,DP02_0154PM,Unnamed: 618
0,Geography,Geographic Area Name,Estimate!!HOUSEHOLDS BY TYPE!!Total households,Margin of Error!!HOUSEHOLDS BY TYPE!!Total hou...,Estimate!!HOUSEHOLDS BY TYPE!!Total households...,Margin of Error!!HOUSEHOLDS BY TYPE!!Total hou...,Estimate!!HOUSEHOLDS BY TYPE!!Total households...,Margin of Error!!HOUSEHOLDS BY TYPE!!Total hou...,Estimate!!HOUSEHOLDS BY TYPE!!Total households...,Margin of Error!!HOUSEHOLDS BY TYPE!!Total hou...,...,Percent Margin of Error!!ANCESTRY!!Total popul...,Percent!!ANCESTRY!!Total population!!West Indi...,Percent Margin of Error!!ANCESTRY!!Total popul...,Percent!!COMPUTERS AND INTERNET USE!!Total hou...,Percent Margin of Error!!COMPUTERS AND INTERNE...,Percent!!COMPUTERS AND INTERNET USE!!Total hou...,Percent Margin of Error!!COMPUTERS AND INTERNE...,Percent!!COMPUTERS AND INTERNET USE!!Total hou...,Percent Margin of Error!!COMPUTERS AND INTERNE...,
1,0500000US01003,"Baldwin County, Alabama",98854,3781,56885,3828,18113,2367,5562,1687,...,0.2,0.4,0.4,98854,(X),94.6,1.6,89.5,2.2,
2,0500000US01015,"Calhoun County, Alabama",45701,1562,18263,1895,5013,1230,1883,817,...,N,N,N,45701,(X),94.2,1.9,87.8,2.4,
3,0500000US01043,"Cullman County, Alabama",35966,1274,19406,1833,7415,1459,2422,938,...,N,N,N,35966,(X),93.9,2.1,88.6,2.4,
4,0500000US01049,"DeKalb County, Alabama",26459,1114,12586,1415,4195,862,1306,537,...,N,N,N,26459,(X),93.1,2.2,88.7,3.2,
5,0500000US01051,"Elmore County, Alabama",34061,981,18437,1778,7542,1354,1785,795,...,0.9,0.0,0.2,34061,(X),95.6,1.8,93.4,2.1,
6,0500000US01055,"Etowah County, Alabama",39956,1533,20334,1992,5736,1293,2033,869,...,N,N,N,39956,(X),93.0,1.6,85.6,2.7,
7,0500000US01069,"Houston County, Alabama",42417,1234,18272,1195,6005,758,2180,467,...,0.3,0.3,0.3,42417,(X),93.0,1.3,86.2,1.9,
8,0500000US01073,"Jefferson County, Alabama",271877,2948,108510,4382,40212,3193,13595,1924,...,0.1,0.4,0.3,271877,(X),95.7,0.7,90.2,1.2,
9,0500000US01077,"Lauderdale County, Alabama",39021,1359,19509,1751,5861,1259,1909,693,...,0.2,0.1,0.1,39021,(X),91.6,2.3,85.9,3.0,


### Selecting Columns
Usually, you're going to work with data that is this big or bigger, but only a few variables will actually be relevant to your research. For this chapter we definitely don't need to look at 619 variables. In the case of the 2022 ACS County Data, the first two variables, `GEO_ID` and `NAME` should remain without question: these represent the actual units of observation. 

ACS provides researchers with a [column metadata](https://raw.githubusercontent.com/ZacharyST/Python_for_Social_Science/main/Data/ACS/DP02/County/ACSDP1Y2022.DP02-Column-Metadata.csv) file for ever one of their data files. The column metadata file tells us all of the variable _names_ and their _label_ or descriptions. For our exercise, we are going to select the following variables:
- DP02_0001E - Total households
- DP02_0018E - Population in households
- DP02_0053PE - Population 3 years and over enrolled in school
- DP02_0054PE -	Population 3 years and over enrolled in school!!Nursery school, preschool
- DP02_0055PE - Population 3 years and over enrolled in school!!Kindergarten
- DP02_0056PE - Population 3 years and over enrolled in school!!Elementary school (grades 1-8)
- DP02_0057PE - Population 3 years and over enrolled in school!!High school (grades 9-12)
- DP02_0058PE - Population 3 years and over enrolled in school!!College or graduate school

We select by assigning the subset of variables that we want to take from the `ACS_2022` dataframe, into a new object named `education_pop`. No actual function is needed here, you just need double-brackets `[[]]`, and to remember to put the variable names in quotes.

In [11]:
education_pop = ACS_2022[['DP02_0001E', 'DP02_0018E', 'DP02_0053PE', 'DP02_0054PE', 'DP02_0055PE', 'DP02_0056PE', 'DP02_0057PE', 'DP02_0058PE']]
education_pop.head(10)

Unnamed: 0,DP02_0001E,DP02_0018E,DP02_0053PE,DP02_0054PE,DP02_0055PE,DP02_0056PE,DP02_0057PE,DP02_0058PE
0,Estimate!!HOUSEHOLDS BY TYPE!!Total households,Estimate!!RELATIONSHIP!!Population in households,Percent!!SCHOOL ENROLLMENT!!Population 3 years...,Percent!!SCHOOL ENROLLMENT!!Population 3 years...,Percent!!SCHOOL ENROLLMENT!!Population 3 years...,Percent!!SCHOOL ENROLLMENT!!Population 3 years...,Percent!!SCHOOL ENROLLMENT!!Population 3 years...,Percent!!SCHOOL ENROLLMENT!!Population 3 years...
1,98854,242869,50306,7.6,6.0,39.9,27.2,19.3
2,45701,111180,26480,7.3,2.8,39.7,23.8,26.3
3,35966,89458,19677,11.4,5.4,44.0,21.0,18.2
4,26459,71128,16468,4.2,3.8,50.7,23.2,18.1
5,34061,85530,19722,6.5,4.5,46.9,24.6,17.5
6,39956,101331,21742,6.4,9.6,43.6,20.5,20.0
7,42417,107231,24188,7.0,5.9,44.5,21.9,20.7
8,271877,645138,166550,8.4,4.8,38.8,20.9,27.1
9,39021,92661,22138,3.8,3.1,43.9,18.1,31.2


We made a mistake! The `education_pop` dataframe is missing the two variables that identify the geographical place. Use the empty code cell below to assign `education_pop` with all the right variables.

In [12]:
# Reassign education_pop here

# Delete this from exercise
education_pop = ACS_2022[['GEO_ID','NAME','DP02_0001E', 'DP02_0018E', 'DP02_0053PE', 'DP02_0054PE', 'DP02_0055PE', 'DP02_0056PE', 'DP02_0057PE', 'DP02_0058PE']]
education_pop.head(10)

Unnamed: 0,GEO_ID,NAME,DP02_0001E,DP02_0018E,DP02_0053PE,DP02_0054PE,DP02_0055PE,DP02_0056PE,DP02_0057PE,DP02_0058PE
0,Geography,Geographic Area Name,Estimate!!HOUSEHOLDS BY TYPE!!Total households,Estimate!!RELATIONSHIP!!Population in households,Percent!!SCHOOL ENROLLMENT!!Population 3 years...,Percent!!SCHOOL ENROLLMENT!!Population 3 years...,Percent!!SCHOOL ENROLLMENT!!Population 3 years...,Percent!!SCHOOL ENROLLMENT!!Population 3 years...,Percent!!SCHOOL ENROLLMENT!!Population 3 years...,Percent!!SCHOOL ENROLLMENT!!Population 3 years...
1,0500000US01003,"Baldwin County, Alabama",98854,242869,50306,7.6,6.0,39.9,27.2,19.3
2,0500000US01015,"Calhoun County, Alabama",45701,111180,26480,7.3,2.8,39.7,23.8,26.3
3,0500000US01043,"Cullman County, Alabama",35966,89458,19677,11.4,5.4,44.0,21.0,18.2
4,0500000US01049,"DeKalb County, Alabama",26459,71128,16468,4.2,3.8,50.7,23.2,18.1
5,0500000US01051,"Elmore County, Alabama",34061,85530,19722,6.5,4.5,46.9,24.6,17.5
6,0500000US01055,"Etowah County, Alabama",39956,101331,21742,6.4,9.6,43.6,20.5,20.0
7,0500000US01069,"Houston County, Alabama",42417,107231,24188,7.0,5.9,44.5,21.9,20.7
8,0500000US01073,"Jefferson County, Alabama",271877,645138,166550,8.4,4.8,38.8,20.9,27.1
9,0500000US01077,"Lauderdale County, Alabama",39021,92661,22138,3.8,3.1,43.9,18.1,31.2


Now inspect your dataframe.

In [13]:
education_pop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 849 entries, 0 to 848
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   GEO_ID       849 non-null    object
 1   NAME         849 non-null    object
 2   DP02_0001E   838 non-null    object
 3   DP02_0018E   838 non-null    object
 4   DP02_0053PE  838 non-null    object
 5   DP02_0054PE  838 non-null    object
 6   DP02_0055PE  838 non-null    object
 7   DP02_0056PE  838 non-null    object
 8   DP02_0057PE  838 non-null    object
 9   DP02_0058PE  838 non-null    object
dtypes: object(10)
memory usage: 66.5+ KB


Something is still wrong with the data after inspection: The data type under the `Dtype` column says 'object'. In pandas, the object data type means that the column in question has text data or a mix of numbers and text. Why is this? Look back at the output of the last `education_pop.head()` command. What does the first observation on row 0 look like? This is the data labels row and it is keeping Python from interpreting our numeric values as numbers. This kind of data mismatch happens _very_ often, so let's clean this dataframe up by doing a few things.
1. Use pandas function `.drop()` to remove the first row from the dataframe, because it is text data and we want numeric.
2. Check the datatypes again
3. Reset the data type into numeric.

### Drop rows with .drop()

In [14]:
# First row's index is "0". axis=0 means drop a row; axis=1 means drop a column.
education_pop = education_pop.drop(0, axis=0)

# Check the first ten rows: The label row is gone!
education_pop.head(10)

Unnamed: 0,GEO_ID,NAME,DP02_0001E,DP02_0018E,DP02_0053PE,DP02_0054PE,DP02_0055PE,DP02_0056PE,DP02_0057PE,DP02_0058PE
1,0500000US01003,"Baldwin County, Alabama",98854,242869,50306,7.6,6.0,39.9,27.2,19.3
2,0500000US01015,"Calhoun County, Alabama",45701,111180,26480,7.3,2.8,39.7,23.8,26.3
3,0500000US01043,"Cullman County, Alabama",35966,89458,19677,11.4,5.4,44.0,21.0,18.2
4,0500000US01049,"DeKalb County, Alabama",26459,71128,16468,4.2,3.8,50.7,23.2,18.1
5,0500000US01051,"Elmore County, Alabama",34061,85530,19722,6.5,4.5,46.9,24.6,17.5
6,0500000US01055,"Etowah County, Alabama",39956,101331,21742,6.4,9.6,43.6,20.5,20.0
7,0500000US01069,"Houston County, Alabama",42417,107231,24188,7.0,5.9,44.5,21.9,20.7
8,0500000US01073,"Jefferson County, Alabama",271877,645138,166550,8.4,4.8,38.8,20.9,27.1
9,0500000US01077,"Lauderdale County, Alabama",39021,92661,22138,3.8,3.1,43.9,18.1,31.2
10,0500000US01081,"Lee County, Alabama",71830,173539,60932,4.4,3.4,28.7,13.3,50.2


Since we dropped all the variable labels, the dataframe has become somewhat hard to read without a data dictionary nearby. There aren't too many variables so we can change the names of our choosing with `.rename()`. For context, we'll provide the dictionary summary again:

- DP02_0001E - Total households
- DP02_0018E - Population in households
- DP02_0053PE - Population 3 years and over enrolled in school
- DP02_0054PE -	Population 3 years and over enrolled in school!!Nursery school, preschool
- DP02_0055PE - Population 3 years and over enrolled in school!!Kindergarten
- DP02_0056PE - Population 3 years and over enrolled in school!!Elementary school (grades 1-8)
- DP02_0057PE - Population 3 years and over enrolled in school!!High school (grades 9-12)
- DP02_0058PE - Population 3 years and over enrolled in school!!College or graduate school


### Rename pandas columns
New variable names should not be longer than two words, and one word would be best if it's descriptive enough. If there are more than two words, we always use an underscore (`one_two`) between them, never an empty space nor a period. The first two variables, `GEO_ID` and `NAME` are descriptive enough, so we can exclude them from this renaming function and they will stay the same.

In [15]:
education_pop.rename(columns={'DP02_0001E': 'TOT_HOUSEHOLD', 
                              'DP02_0018E':'POPULATION', 
                              'DP02_0053PE':'TOT_ENROLLED', 
                              'DP02_0054PE':'PRE_K', 
                              'DP02_0055PE':'KINDER', 
                              'DP02_0056PE':'ELEMENTARY', 
                              'DP02_0057PE':'HIGH_SCHOOL', 
                              'DP02_0058PE':'COLLEGE'}, inplace=True) # Note, inplace=true allows us to rename pandas 
                                                                      # dataframe columns without needing to use an '=' sign

education_pop.head(10)


Unnamed: 0,GEO_ID,NAME,TOT_HOUSEHOLD,POPULATION,TOT_ENROLLED,PRE_K,KINDER,ELEMENTARY,HIGH_SCHOOL,COLLEGE
1,0500000US01003,"Baldwin County, Alabama",98854,242869,50306,7.6,6.0,39.9,27.2,19.3
2,0500000US01015,"Calhoun County, Alabama",45701,111180,26480,7.3,2.8,39.7,23.8,26.3
3,0500000US01043,"Cullman County, Alabama",35966,89458,19677,11.4,5.4,44.0,21.0,18.2
4,0500000US01049,"DeKalb County, Alabama",26459,71128,16468,4.2,3.8,50.7,23.2,18.1
5,0500000US01051,"Elmore County, Alabama",34061,85530,19722,6.5,4.5,46.9,24.6,17.5
6,0500000US01055,"Etowah County, Alabama",39956,101331,21742,6.4,9.6,43.6,20.5,20.0
7,0500000US01069,"Houston County, Alabama",42417,107231,24188,7.0,5.9,44.5,21.9,20.7
8,0500000US01073,"Jefferson County, Alabama",271877,645138,166550,8.4,4.8,38.8,20.9,27.1
9,0500000US01077,"Lauderdale County, Alabama",39021,92661,22138,3.8,3.1,43.9,18.1,31.2
10,0500000US01081,"Lee County, Alabama",71830,173539,60932,4.4,3.4,28.7,13.3,50.2


The dataframe should now be much simpler for you to read and understand. Now back to our original problem: numeric variables were being interpreted as `object` data types. You have already dropped the row that contained text data from all of our variables. Use `.info()` again to check whether this is still the case

In [16]:
# Check the data types now
education_pop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 848 entries, 1 to 848
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   GEO_ID         848 non-null    object
 1   NAME           848 non-null    object
 2   TOT_HOUSEHOLD  837 non-null    object
 3   POPULATION     837 non-null    object
 4   TOT_ENROLLED   837 non-null    object
 5   PRE_K          837 non-null    object
 6   KINDER         837 non-null    object
 7   ELEMENTARY     837 non-null    object
 8   HIGH_SCHOOL    837 non-null    object
 9   COLLEGE        837 non-null    object
dtypes: object(10)
memory usage: 66.4+ KB


### Pandas datatypes: to_numeric()
It is as we feared: all of our numeric variables are still categorized as the object type. The conversion is not automatic even though we removed text values. We will need to re-assign them to the correct data type, which in pandas dataframes can be `int` or `float`. Either will work, so let's get this sorted out using `apply` with pandas' `pd.to_numeric()` function. 

In [17]:
education_pop[['TOT_HOUSEHOLD',
               'POPULATION', 
               'TOT_ENROLLED',
               'PRE_K','KINDER',
               'ELEMENTARY',
               'HIGH_SCHOOL',
               'COLLEGE']] = education_pop[['TOT_HOUSEHOLD',
                                            'POPULATION',
                                            'TOT_ENROLLED',
                                            'PRE_K','KINDER',
                                            'ELEMENTARY',
                                            'HIGH_SCHOOL',
                                            'COLLEGE']].apply(pd.to_numeric) 
education_pop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 848 entries, 1 to 848
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   GEO_ID         848 non-null    object 
 1   NAME           848 non-null    object 
 2   TOT_HOUSEHOLD  837 non-null    float64
 3   POPULATION     837 non-null    float64
 4   TOT_ENROLLED   837 non-null    float64
 5   PRE_K          837 non-null    float64
 6   KINDER         837 non-null    float64
 7   ELEMENTARY     837 non-null    float64
 8   HIGH_SCHOOL    837 non-null    float64
 9   COLLEGE        837 non-null    float64
dtypes: float64(8), object(2)
memory usage: 66.4+ KB


Success! Python now understands that our numeric variables are indeed float (numeric) data types. Some comments about the code above: We had to specify each variable that needed transforming on the left side of the $=$ sign as the columns to receive new values from the right side of the $=$ sign. Then, we needed to repeat this specification of the columns to transform into numeric data types with `.apply(pd.to_numeric)`. The spacing might look a little odd right now, but you will become used to it! Lines of code can be incredibly long, and pressing the enter key to break the length of the line is a crucial part of making your code legible.

Now you might be curious about the data types of our first two variables, `GEO_ID` and `COUNTY`, and whether we ought to transform these into _string_ types. The short answer would be no; We should only really worry if a column that we know is numeric is not categorized as such. This is because most, if not all, mathematical functions require numeric columns as inputs. There aren't as many text-based operators that are as strict about an object type for the whole column.

### New Variable Creation: Split
One of our dataframe's variables has a name that is inconsistent with the values within the variable however. The column `NAME` actually contains two values in each cell, the county name and the state name, separated by a comma. It would be more helpful for us to have these two variables as separate columns. You can use `str.split()` to make two new columns, then `.insert()` to place them after the original `NAME` column.

The `str.split()` function takes the values in a cell and separates them by whatever symbol you tell it is the delimiter. In our case the county and state are separated by a comma. The function spits out two series of values, one on either side of a comma: The first value is indexed as `0`, and the second value is indexed as position `1`, etc., just like the vectors we have been creating in this chapter.

In [18]:
state = education_pop['NAME'].str.split(',').str[1] # The "1" in brackets represents the second string after the comma
county = education_pop['NAME'].str.split(',').str[0] # The "0" represents the first string before the comma
state.head(3)
county.head(3)

education_pop.insert(2, 'COUNTY', county)
education_pop.insert(3, 'STATE', state)
education_pop.head(5)

1     Alabama
2     Alabama
3     Alabama
Name: NAME, dtype: object

1    Baldwin County
2    Calhoun County
3    Cullman County
Name: NAME, dtype: object

Unnamed: 0,GEO_ID,NAME,COUNTY,STATE,TOT_HOUSEHOLD,POPULATION,TOT_ENROLLED,PRE_K,KINDER,ELEMENTARY,HIGH_SCHOOL,COLLEGE
1,0500000US01003,"Baldwin County, Alabama",Baldwin County,Alabama,98854.0,242869.0,50306.0,7.6,6.0,39.9,27.2,19.3
2,0500000US01015,"Calhoun County, Alabama",Calhoun County,Alabama,45701.0,111180.0,26480.0,7.3,2.8,39.7,23.8,26.3
3,0500000US01043,"Cullman County, Alabama",Cullman County,Alabama,35966.0,89458.0,19677.0,11.4,5.4,44.0,21.0,18.2
4,0500000US01049,"DeKalb County, Alabama",DeKalb County,Alabama,26459.0,71128.0,16468.0,4.2,3.8,50.7,23.2,18.1
5,0500000US01051,"Elmore County, Alabama",Elmore County,Alabama,34061.0,85530.0,19722.0,6.5,4.5,46.9,24.6,17.5


This will make it easier for us to sort and subset this data by state, or inspect specific counties within states. We are left with the original `NAME` variable in the second column, which is fine. However, just in case you wanted to practice removing a column from your dataframe, there is a code cell below for you to give it a try. 

Hint 1: We already dropped one _row_ in an earlier cell using `.drop()`. That line of code should be enough to show you how to drop a column.

Hint 2: You can use a column name _or_ the column's index. Just be mindful which index you are using! `education_pop.info()` will print the index next to the variable name.

### New Variable Creation: Vector Math
Similarly to the string split variables we created above, we can use basic math operators in python to calculate new values based on the numbers in our data. Take the final five columns in our data. They represent the percentage of enrolled students at different levels of education, from pre-k to college and graduate school. We might be more interested in the actual number of people enrolled in college, not the percentage. We can create that new column with some simple math and (relatively) simple code! Let's look at the problem a little more abstractly. We have the total number of people enrolled in some form of educational institution in the variable `TOT_ENROLLED`, and percentages of that total number in the final five columns of the dataframe. We can calculate the actual number of people using the percentage as a decimal, that is, if the percentage is $19\%$, you multiply total population times $0.19$ to get the right number of people. 

So we have to think of those data points we are operating upon as cells along a vector. Pandas lets you do that quite easily: You provide the names of the columns and the math operators you need, including using parentheses for specific order-of-operations. Let's see one way to do it in the following steps:

1. First check whether the percentages add up to 100, just to be safe. We do this with the `.iloc[]` function that determines the range of columns, then append a `.sum()` function to add those values along the columns axis.
2. Check your math by multiplying and dividing variables. In our case we want the enrolled population, `TOT_ENROLLED`, multiplied by the college percentage, `COLLEGE`, itself divided by 100 to move the decimal point to the right.
3. Once that math checks out, we can assign that operation to a new variable for the college population, `COLLEGE_POP`.

In [19]:
education_pop[['TOT_HOUSEHOLD',
               'POPULATION', 
               'TOT_ENROLLED',
               'PRE_K','KINDER',
               'ELEMENTARY',
               'HIGH_SCHOOL',
               'COLLEGE']].sum(axis=1) # The percentages sum up to 100%.
#Switch to column label over column index number. 

education_pop['TOT_ENROLLED'] * (education_pop['COLLEGE']/100) # Gives us a vector equal to the actual number of enrolled students

education_pop['COLLEGE_POP'] = education_pop['TOT_ENROLLED'] * (education_pop['COLLEGE']/100) 
# You can create a new variable on the left-side of the equals sign and assign its value on the right side.

education_pop.head(5)

1      392129.0
2      183460.9
3      145201.0
4      114155.0
5      139413.0
         ...   
844         0.0
845         0.0
846         0.0
847         0.0
848         0.0
Length: 848, dtype: float64

1      9709.058
2      6964.240
3      3581.214
4      2980.708
5      3451.350
         ...   
844         NaN
845         NaN
846         NaN
847         NaN
848         NaN
Length: 848, dtype: float64

Unnamed: 0,GEO_ID,NAME,COUNTY,STATE,TOT_HOUSEHOLD,POPULATION,TOT_ENROLLED,PRE_K,KINDER,ELEMENTARY,HIGH_SCHOOL,COLLEGE,COLLEGE_POP
1,0500000US01003,"Baldwin County, Alabama",Baldwin County,Alabama,98854.0,242869.0,50306.0,7.6,6.0,39.9,27.2,19.3,9709.058
2,0500000US01015,"Calhoun County, Alabama",Calhoun County,Alabama,45701.0,111180.0,26480.0,7.3,2.8,39.7,23.8,26.3,6964.24
3,0500000US01043,"Cullman County, Alabama",Cullman County,Alabama,35966.0,89458.0,19677.0,11.4,5.4,44.0,21.0,18.2,3581.214
4,0500000US01049,"DeKalb County, Alabama",DeKalb County,Alabama,26459.0,71128.0,16468.0,4.2,3.8,50.7,23.2,18.1,2980.708
5,0500000US01051,"Elmore County, Alabama",Elmore County,Alabama,34061.0,85530.0,19722.0,6.5,4.5,46.9,24.6,17.5,3451.35


### Subset
When inspecting your data to gain some sort of insight, one of the most common tools you will use is subsetting. Subsets of data are sections of the whole dataset that meet some specific criteria that are of interest to you. We just subset a ACS 2022 dataset into a smaller dataset few relevant variables, for example. However, there are still hundreds and hundreds of observations that may not all be relevant or useful to your interests. A researcher might only need data for a specific state or region. If that was the case, we can subset with conditional statements using `str.contains()`. This would be a filtering process.

In [20]:
oregon = education_pop[education_pop['STATE'].str.contains('Oregon')] 
oregon.head(5)
oregon.shape # How many rows and columns

Unnamed: 0,GEO_ID,NAME,COUNTY,STATE,TOT_HOUSEHOLD,POPULATION,TOT_ENROLLED,PRE_K,KINDER,ELEMENTARY,HIGH_SCHOOL,COLLEGE,COLLEGE_POP
588,0500000US41003,"Benton County, Oregon",Benton County,Oregon,38831.0,90998.0,33497.0,3.6,2.0,22.0,9.7,62.7,21002.619
589,0500000US41005,"Clackamas County, Oregon",Clackamas County,Oregon,163805.0,418638.0,86983.0,5.9,6.3,44.4,25.0,18.3,15917.889
590,0500000US41017,"Deschutes County, Oregon",Deschutes County,Oregon,85108.0,205263.0,35176.0,7.3,4.7,46.6,26.2,15.2,5346.752
591,0500000US41019,"Douglas County, Oregon",Douglas County,Oregon,47263.0,110078.0,20650.0,7.1,4.0,45.1,27.3,16.5,3407.25
592,0500000US41029,"Jackson County, Oregon",Jackson County,Oregon,92554.0,217593.0,45151.0,3.5,3.6,46.6,24.7,21.6,9752.616


(15, 13)

We can also subset by quantity using numeric vectors, so with a very similar syntax we can filter by counties that have more than a million residents. The main difference is we do not need to call any additional functions like `str.contains()`. A simple logical operator `>` on the relevant column is enough.

In [25]:
large_counties = education_pop[education_pop['POPULATION']>1000000]
large_counties.head(5)
large_counties.shape # How many rows and columns

Unnamed: 0,GEO_ID,NAME,COUNTY,STATE,TOT_HOUSEHOLD,POPULATION,TOT_ENROLLED,PRE_K,KINDER,ELEMENTARY,HIGH_SCHOOL,COLLEGE,COLLEGE_POP
27,0500000US04013,"Maricopa County, Arizona",Maricopa County,Arizona,1726554.0,4479734.0,1087431.0,4.3,4.8,41.7,23.5,25.6,278382.336
30,0500000US04019,"Pima County, Arizona",Pima County,Arizona,436469.0,1029702.0,255016.0,3.4,4.0,36.8,20.7,35.1,89510.616
45,0500000US06001,"Alameda County, California",Alameda County,California,596614.0,1591744.0,385159.0,5.3,5.4,35.7,19.5,34.1,131339.219
47,0500000US06013,"Contra Costa County, California",Contra Costa County,California,415194.0,1144370.0,283487.0,5.9,4.8,39.1,23.9,26.3,74557.081
55,0500000US06037,"Los Angeles County, California",Los Angeles County,California,3415726.0,9533172.0,2340018.0,5.1,4.6,37.8,21.6,31.0,725405.58


(45, 13)

And a more complicated subsetting based on multiple variables is also possible. If you wanted to obtain observations for counties in California __and__ with over a million residents, you can embed both filters into the square brackets argument. Each condition has to be wrapped in parentheses `()` and you would use a logical operators `&` for 'AND', `|` for 'OR' between conditions.

In [31]:
large_counties_ca = education_pop[(education_pop['STATE'].str.contains('California')) & (education_pop['POPULATION']>1000000)]
large_counties_ca


Unnamed: 0,GEO_ID,NAME,COUNTY,STATE,TOT_HOUSEHOLD,POPULATION,TOT_ENROLLED,PRE_K,KINDER,ELEMENTARY,HIGH_SCHOOL,COLLEGE,COLLEGE_POP
45,0500000US06001,"Alameda County, California",Alameda County,California,596614.0,1591744.0,385159.0,5.3,5.4,35.7,19.5,34.1,131339.219
47,0500000US06013,"Contra Costa County, California",Contra Costa County,California,415194.0,1144370.0,283487.0,5.9,4.8,39.1,23.9,26.3,74557.081
55,0500000US06037,"Los Angeles County, California",Los Angeles County,California,3415726.0,9533172.0,2340018.0,5.1,4.6,37.8,21.6,31.0,725405.58
63,0500000US06059,"Orange County, California",Orange County,California,1085225.0,3094994.0,812421.0,5.9,4.9,35.6,19.6,34.0,276223.14
65,0500000US06065,"Riverside County, California",Riverside County,California,769475.0,2439260.0,652607.0,3.9,4.5,41.6,22.8,27.3,178161.711
66,0500000US06067,"Sacramento County, California",Sacramento County,California,572744.0,1558455.0,394311.0,4.8,4.2,41.0,21.3,28.8,113561.568
68,0500000US06071,"San Bernardino County, California",San Bernardino County,California,674191.0,2155780.0,604894.0,4.3,4.5,41.9,22.8,26.5,160296.91
69,0500000US06073,"San Diego County, California",San Diego County,California,1172343.0,3160203.0,811451.0,5.5,4.5,37.4,19.7,32.9,266967.379
75,0500000US06085,"Santa Clara County, California",Santa Clara County,California,656477.0,1829890.0,468451.0,6.2,3.8,36.0,21.2,32.8,153651.928


### Sort
Finally, we can use a sorting function to order our observations from largest to smallest or vice-versa. This is particularly useful when we are interested in the greatest or least observations for a given variable. For example, what if we were curious about the counties with the highest proportion of residents enrolled in college? We should sort by descending values along the `COLLEGE` column using `.sort_values()`, and see the top ten counties with the `head()` function. We include an argument that `ascending=False` to make the data go from largest to smallest value. The default value is `True`, and sorting this way will give you values smallest to largest, so pay attention!

In [None]:
education_pop.sort_values(by='COLLEGE', ascending=False).head(10)

Note that in this example we chose not to assign the output to a new dataframe. This is a possible choice when subsetting and filtering data when you simply want to glance at the output: For example you might only want to corroborate that your code really works, in which case it is a fine choice to use subsetting code without assigning a new object. 

### Saving your Data
Now that we are done with our first glimpses into the ACS DP02 data set, it's time to save our progress. You should have several pandas dataframes stored in your Python environment. You can check which dataframes are loaded in the environment with the `%whos` command below.

In [None]:
%whos DataFrame

There you have all of the pandas dataframes we have assigned in this notebook. The clean and subset data we want you to save on your computer are three: `education_pop`, `oregon`, and `large_counties`. We'll save the larger dataframe, education_pop, as a comma separated values file, and the other two we'll save in the ubiquitous microsoft excel `.xlsx` format. The save function depends on the file format. 

In [None]:
import os
print(os.getcwd()) # Get our current working directory

You should be in your `Python for Social Sciences/lessons/pandas_I/` folder. This is a fine place to save your chapter 2 datasets using pandas' `to_csv()` and `to_excel()` functions.  

In [None]:
education_pop.to_csv('education_pop.csv', index=False) # index writes row names if True, so always tell it False

oregon.to_excel('oregon.xlsx', index=False)

large_counties.to_excel('large_counties.xlsx', index=False)

os.listdir() # to see if your objects are saved in the current directory.

## Bonus Script: Random Number Generation
For our imaginary sample of patient data at the beginning of the chapter, we created all of our vectors using a random number generator from the `numpy` package. The function returns random numbers based on some parameters we provided. We wanted to create height, weight, calorie, and minutes, all of which tend to look like a normal distribution (or a bell curve if you've heard of it called that way). So we use the `np.random.normal()` function and give it 
- an average value with `loc=`
- the standard deviation around the mean with `scale=`
- and the number of random numbers to produce with `size=`.

For height, calories, and minutes of daily activity, we wanted the numbers to be whole, without decimal points, so we added `.astype(int)` to our random number generator line. We did want decimal points for weight on the other hand, so we wrapped the entire random function inside of `np.round(..., decimals=y)` to specify the number of decimal points. Finally, we did a very superficial web search with term like "daily physical activity mean and standard deviation" to make somewhat realistic averages and standard deviations.

In statistics, and especially bayesian statistics, creating data this way is a very regular part of running simulations. For methods instructors like us, it is a handy way to make realistic examples to teach. 

In [None]:
# Our random number array generator

import numpy as np # load the numpy package 

np.random.normal(loc=177, scale=7.59, size=20).astype(int)  #height. loc = mean, scale = standard deviation. astype(int) drops decimal points.

np.round(np.random.normal(loc=76, scale=13, size=20),decimals=2) # With 2 decimal points for weight

np.random.normal(loc=2142, scale=267, size=20).astype(int) # Calories per day

np.random.normal(loc=400, scale=350, size=20).astype(int) # Minutes of activity