# Dataframes with Pandas

Dataframes are the common format for social science data, whether you are using Excel, R, or Python. You can visualize dataframes as a table full of values where, generally, each column is a variable and each row is a unit of observation. You will become familiar with dataframes by creating some imaginary data before moving to real datasets from the American Community Survey. 

Some project management tips while you write and run code for this chapter: We suggest making a sub-directory for this chapter inside your Python for Social Science folder and saving your JupyterLab notebook for chapter 2 inside. Create a `data/` folder for the dataset you will be dowloading. We will make this same recommendation in each subsequent chapter of this book.

## Vectors

The smallest bit of data is the individual observation, and it is not very useful on its own: _"Person X's height is 177 centimeters."_ Multiple observations like this, however, are more useful. If a medical researcher recorded the height of ten patients, we would have a vector of data. A vector is a one-dimensional array of values; it has one dimension in the sense that there is only one type of variable, over _x_ number of observations. Be aware that the terms to describe a vector change across different programming languages. In Python, one-dimensional series of data points is called an __array__. Although the concept and shape of the data are essentially the same, there are other forms and names for a one-dimensional vector in Python programming, which have distinct uses and quirks in programming. In this book, we will consider vectors and arrays as synonymous.

In the cell below, we can create an array for the recorded height of ten patients. Note that we created all of these observations at random using imaginary averages.

In [57]:
patient_height = {177, 174, 170, 183, 168, 182, 163, 191, 177, 176, 173, 186, 174, 168, 184, 170, 170, 192, 181, 173}
patient_height

{163, 168, 170, 173, 174, 176, 177, 181, 182, 183, 184, 186, 191, 192}

Notice how Python prints the values of our vector by order of smallest to largest despite us entering the heights in a specific order. This is because numeric sets in Python are never ordered (although you could use the "ordered-sets" package). Using the `print()` function gives us a different order altogether:

In [58]:
print(patient_height)

{192, 163, 168, 170, 173, 174, 176, 177, 181, 182, 183, 184, 186, 191}


This is a serious problem for social science data because we expect the order of observations in a vector to be consistent with the vectors of other variables, say patient height and patient weight, because these should reflect the measurements of a unique person. Notice as well that the default `print()` command has truncated the twenty observations to display only 14 values of `patient_height`.  Our solution is to use the Pandas package in Python, which was designed to create, manipulate, and observe tables of vectors of the same length. In other words: dataframes.

## Pandas Series
A single set of observations for patient height is also not very useful. Researchers would also be interested in further physical measurements like weight, daily calories, minutes of physical activity, etc. You can create these individually with the `Series` function in pandas. 

In [62]:
import pandas as pd

patient_height = pd.Series([177, 174, 170, 183, 168, 182, 163, 191, 177, 176, 173, 186, 174, 168, 184, 170, 170, 192, 181, 173])

patient_weight = pd.Series([99.38, 69.31, 58.4, 75.42, 86.91, 66.39, 74.57, 87.31, 73.1, 82.97, 85.89, 79.16, 59.82, 78.9, 93.53, 97.3, 51.95, 109.97, 89.79, 85.17])

weekly_activity = pd.Series([153, 541, 373, 246, 312, 123, 313, 295, 139, 328, 133, 191, 112, 150, 172, 401, 460, 395, 196, 121])

daily_calories = pd.Series([2418, 2830, 2113, 2022, 2255, 2555, 1945, 2381, 2379, 2178, 2164, 1652, 2448, 1922, 2006, 2391, 2110, 2421, 2522, 1815])

print(patient_weight)

0      99.38
1      69.31
2      58.40
3      75.42
4      86.91
5      66.39
6      74.57
7      87.31
8      73.10
9      82.97
10     85.89
11     79.16
12     59.82
13     78.90
14     93.53
15     97.30
16     51.95
17    109.97
18     89.79
19     85.17
dtype: float64


Now our vectors print in the order we entered initially, which in theory must match the order of the individuals observed. Python also tells us that the vector contains "float64" values, which simply means a real number with (potentially) millions of decimal points.

## Pandas Dataframes

To put it very simply, aggregating two or more vectors makes a dataframe with some important caveats on data quality. First, vectors must be the same length or else the number of observations wouldn't match. Second, we need to be sure that the data in different vector correspond to the same unit of observation; That height<sub>i</sub> and weight<sub>i</sub> are both measurements of the same person. 

Now that we have some vectors for our fictitious 20 person study, how would we go about analyzing the data? Well, some statistical analyses can be performed with just the arrays we named above: `patient_height, patient_weight, weekly_activity, daily_calories`. But for most data analysis applications, we would need to put these arrays together into a table where each vector would be a vertical column. We can do this with the `concat()` function in pandas.

In [69]:
df1 = pd.concat([patient_height, patient_weight, weekly_activity, daily_calories], axis=1)
df1

Unnamed: 0,0,1,2,3
0,177,99.38,153,2418
1,174,69.31,541,2830
2,170,58.4,373,2113
3,183,75.42,246,2022
4,168,86.91,312,2255
5,182,66.39,123,2555
6,163,74.57,313,1945
7,191,87.31,295,2381
8,177,73.1,139,2379
9,176,82.97,328,2178


Note that we used the argument `axis=1` at the end of the concat() function to merge the vectors horizontally. Without this argument, concat() would have merged all the observations into a single column 80 cells long. Now, this table is pretty bare-bones, but it is a dataframe. The problem is that it has no variable names to make the data a little more legible. To include variable names, we can do a number of things differently. Observe how the different steps below all result in the exact same dataframe using:
- The `name=" "` argument within the `pd.Series()` function.
- The `x.name=" "` function to rename a vector with or without a current name.
- The `pd.DataFrame()` function to create a dataframe from the start.

In [74]:
# We can assign a name to the pandas Series types
patient_height = pd.Series([177, 174, 170, 183, 168, 182, 163, 191, 177, 176, 
                            173, 186, 174, 168, 184, 170, 170, 192, 181, 173], name = "height")
patient_weight = pd.Series([99.38, 69.31, 58.4, 75.42, 86.91, 66.39, 74.57, 
                            87.31, 73.1, 82.97, 85.89, 79.16, 59.82, 78.9, 
                            93.53, 97.3, 51.95, 109.97, 89.79, 85.17], name = "weight")
weekly_activity = pd.Series([153, 541, 373, 246, 312, 123, 313, 295, 139, 
                             328, 133, 191, 112, 150, 172, 401, 460, 395, 196, 121], name = "weekly activity")
daily_calories = pd.Series([2418, 2830, 2113, 2022, 2255, 2555, 1945, 2381, 
                            2379, 2178, 2164, 1652, 2448, 1922, 2006, 2391, 
                            2110, 2421, 2522, 1815], name = "daily calories")
df1_1 = pd.concat([patient_height, patient_weight, weekly_activity, daily_calories], axis=1)
df1_1

Unnamed: 0,height,weight,weekly activity,daily calories
0,177,99.38,153,2418
1,174,69.31,541,2830
2,170,58.4,373,2113
3,183,75.42,246,2022
4,168,86.91,312,2255
5,182,66.39,123,2555
6,163,74.57,313,1945
7,191,87.31,295,2381
8,177,73.1,139,2379
9,176,82.97,328,2178


In [76]:
# We can assign a name to the pandas Series after creating them
patient_height = pd.Series([177, 174, 170, 183, 168, 182, 163, 191, 177, 176, 173, 186, 174, 168, 184, 170, 170, 192, 181, 173])
patient_height.name = "height"

patient_weight = pd.Series([99.38, 69.31, 58.4, 75.42, 86.91, 66.39, 74.57, 87.31, 73.1, 82.97, 85.89, 79.16, 59.82, 78.9, 93.53, 97.3, 51.95, 109.97, 89.79, 85.17])
patient_weight.name = "weight"

weekly_activity = pd.Series([153, 541, 373, 246, 312, 123, 313, 295, 139, 328, 133, 191, 112, 150, 172, 401, 460, 395, 196, 121])
weekly_activity.name = "weekly activity"

daily_calories = pd.Series([2418, 2830, 2113, 2022, 2255, 2555, 1945, 2381, 2379, 2178, 2164, 1652, 2448, 1922, 2006, 2391, 2110, 2421, 2522, 1815])
daily_calories.name = "daily calories"

df1_2 = pd.concat([patient_height, patient_weight, weekly_activity, daily_calories], axis=1)
df1_2

Unnamed: 0,height,weight,weekly activity,daily calories
0,177,99.38,153,2418
1,174,69.31,541,2830
2,170,58.4,373,2113
3,183,75.42,246,2022
4,168,86.91,312,2255
5,182,66.39,123,2555
6,163,74.57,313,1945
7,191,87.31,295,2381
8,177,73.1,139,2379
9,176,82.97,328,2178


In [78]:
# Or create a dataframe wholesale with pd.DataFrame()
df1_3 = pd.DataFrame(
    {
        "height": [177, 174, 170, 183, 168, 182, 163, 191, 177, 176, 173, 186, 174, 168, 184, 170, 170, 192, 181, 173],
        "weight": [99.38, 69.31, 58.4, 75.42, 86.91, 66.39, 74.57, 87.31, 73.1, 82.97, 85.89, 79.16, 59.82, 78.9, 93.53, 97.3, 51.95, 109.97, 89.79, 85.17],
        "weekly activity": [153, 541, 373, 246, 312, 123, 313, 295, 139, 328, 133, 191, 112, 150, 172, 401, 460, 395, 196, 121],
        "daily calories": [2418, 2830, 2113, 2022, 2255, 2555, 1945, 2381, 2379, 2178, 2164, 1652, 2448, 1922, 2006, 2391, 2110, 2421, 2522, 1815]
    }
)

df1_3

Unnamed: 0,height,weight,weekly activity,daily calories
0,177,99.38,153,2418
1,174,69.31,541,2830
2,170,58.4,373,2113
3,183,75.42,246,2022
4,168,86.91,312,2255
5,182,66.39,123,2555
6,163,74.57,313,1945
7,191,87.31,295,2381
8,177,73.1,139,2379
9,176,82.97,328,2178


In [64]:
# random number generation
import numpy as np 
# np.round(np.random.normal(loc=177, scale=7.59, size=20),decimals=2) #With decimals
np.random.normal(loc=177, scale=7.59, size=20).astype(int)  #loc=mean scale = sd.
np.random.normal(loc=2142, scale=267, size=20).astype(int)
#np.round(np.random.normal(loc=81, scale=12, size=20),decimals=2)

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'


array([167, 188, 180, 175, 192, 168, 175, 176, 164, 182, 174, 180, 178,
       190, 175, 168, 160, 179, 171, 183])

array([2311, 2044, 2020, 2265, 1993, 2184, 2382, 1896, 1701, 2406, 2476,
       2087, 2142, 2562, 2349, 2016, 2256, 1696, 1834, 1776])

# To-do
- Explain tabular data
  - Pandas, load and basics
  - Vectors: make a vector. inspect it. then make two vectors.
  - Join the vectors into a dataframe
- Data selection for ACS dp02:
    - task1 download ACS dpo02
    - task2 import ACSDP1Y2022.DP02-Data.csv
    - task3 select relevant data columns
        - variables: geoid, name,
        - DP02_0001E total households
        - DP02_0018E population in households
        - DP02_0053PE pop over three enrolled in school college or grad school
        - DP02_0054PE	Percent!!SCHOOL ENROLLMENT!!Population 3 years and over enrolled in school!!Nursery school, preschool
        - DP02_0055PE	Percent!!SCHOOL ENROLLMENT!!Population 3 years and over enrolled in school!!Kindergarten
        - DP02_0056PE	Percent!!SCHOOL ENROLLMENT!!Population 3 years and over enrolled in school!!Elementary school (grades 1-8)
        - DP02_0057PE	Percent!!SCHOOL ENROLLMENT!!Population 3 years and over enrolled in school!!High school (grades 9-12)
        - DP02_0058PE	Percent!!SCHOOL ENROLLMENT!!Population 3 years and over enrolled in school!!College or graduate school
      - task 4 