# CSS 201.5 - CSS MA Bootcamp

## Lecture 01 - Week 02

### Advanced Data Analysis in Python

## Recap

We learned:

1. `pandas`
2. Loading CSV data
3. Checking the dataset
4. Creating plots
5. Creating `plotly` interactive plots

Great work! Any questions about these contents?

## Today

This week, we start with data wrangling.

This is the most of what you will do in your jobs, so pay attention to these parts.

They are also very boring, what makes us try to not pay attention at all.

Today we will do ***categorical variables***.

# Data Wrangling

## Data Wrangling

Data wrangling, also known as data munging or data preprocessing, refers to the process of cleaning, transforming, and preparing raw data into a structured format suitable for analysis. Here are some common tasks involved in data wrangling:

1. Data cleaning: Handling missing values, dealing with outliers, correcting inconsistent or inaccurate data.

1. Data integration: When working with multiple data sources, data integration involves combining data from different sources into a single dataset.

1. Data transformation: Data transformation involves converting data into a suitable format for analysis.

1. Data reduction: In some cases, the original dataset may be too large or contain unnecessary variables. Data reduction involves selecting relevant variables and reducing the size of the dataset without losing critical information.

1. Handling inconsistencies: Addressing inconsistencies in the data, such as inconsistent formatting, inconsistent units of measurement, or inconsistent categorical values.


Data wrangling is the most important skills you are going to learn as a Computational Social Scientist. It is a crucial step in the data analysis pipeline, as the quality of the final analysis depend on the quality of the prepared data.

# Wrangling Categorical Variables

## Wrangling Categorical Variables

- **Categorical variable**: Variable that represents qualitative data, often organized into discrete categories or groups.

- It can take on a limited number of distinct values or levels, where each value represents a particular group or category.

## Wrangling Categorical Variables

- Categorical variables are commonly used to classify or describe characteristics, attributes, or qualities.

- Examples of categorical variables:
    + *Gender* (male, female, and others)
    + *Marital status* (single, married, divorced, etc)
    + *Educational level* (high school, college, graduate, etc).

## Wrangling Categorical Variables

- Categorical variables can be further classified into:
    + **Nominal**: Categories have no inherent order
    + **Ordinal**: Categories have a specific order or ranking.

## Wrangling Categorical Variables

In [1]:
## Loading a couple of friends in here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
from plotly import express as px
from plotly.subplots import make_subplots

## Wrangling Categorical Variables

In [2]:
# Loading the General Social Survey
dat = pd.read_csv('gss.csv')
dat.head()

Unnamed: 0,region,income,happy,age,finrela,marital,degree,health,wrkstat,partyid,polviews,sex,year
0,E. NOR. CENTRAL,$10000 - 14999,NOT TOO HAPPY,54.0,AVERAGE,MARRIED,LT HIGH SCHOOL,FAIR,WORKING FULLTIME,OTHER PARTY,,MALE,1973
1,E. NOR. CENTRAL,$7000 TO 7999,VERY HAPPY,51.0,AVERAGE,MARRIED,LT HIGH SCHOOL,GOOD,KEEPING HOUSE,NOT STR DEMOCRAT,,FEMALE,1973
2,E. NOR. CENTRAL,$10000 - 14999,PRETTY HAPPY,36.0,AVERAGE,MARRIED,LT HIGH SCHOOL,EXCELLENT,WORKING FULLTIME,IND NEAR REP,,FEMALE,1973
3,E. NOR. CENTRAL,$10000 - 14999,PRETTY HAPPY,32.0,AVERAGE,MARRIED,HIGH SCHOOL,EXCELLENT,WORKING FULLTIME,NOT STR DEMOCRAT,,MALE,1973
4,E. NOR. CENTRAL,$10000 - 14999,PRETTY HAPPY,54.0,AVERAGE,MARRIED,LT HIGH SCHOOL,GOOD,KEEPING HOUSE,IND NEAR REP,,FEMALE,1973


## Wrangling Categorical Variables

**Exercise:** Explore the data.

1. Make at least two plots
2. Generate at least two descriptive stats
3. How about missing data? Tell me something about that and if there is, do something about it.

Feel free to ask me the meaning of the variables, if it is not clear.

In [3]:
## Your code here

## Wrangling Categorical Variables

The first step to take is to explore the variables. There are two ways to do that:

1. `.describe()`
2. `.value_counts()`

In [4]:
# Health status
print(dat.health.describe(), end = '\n\n\n')
print(dat.health.value_counts())

count     40273
unique        6
top        GOOD
freq      17747
Name: health, dtype: object


GOOD         17747
EXCELLENT    12135
FAIR          7402
POOR          2224
IAP            763
DK               2
Name: health, dtype: int64


## Wrangling Categorical Variables

**Exercise:** Do the same with the variable `polviews`.

In [5]:
## Your code here

## Wrangling Categorical Variables

The second thing we can do is to identify data and categorical data we have. 

We should print the data type:

In [6]:
dat.dtypes

region       object
income       object
happy        object
age         float64
finrela      object
marital      object
degree       object
health       object
wrkstat      object
partyid      object
polviews     object
sex          object
year          int64
dtype: object

## Wrangling Categorical Variables

Data types can be:

- `int`: Numerical integer data
- `float`: Numerical continuous data
- `object`: Something else...
- others

We want it to be `category`.

In [7]:
dat.happy.dtype

dtype('O')

In [8]:
dat.happy = dat.happy.astype('category')

## Wrangling Categorical Variables

And when we check it again:

In [9]:
dat.dtypes

region        object
income        object
happy       category
age          float64
finrela       object
marital       object
degree        object
health        object
wrkstat       object
partyid       object
polviews      object
sex           object
year           int64
dtype: object

In [10]:
dat.happy.dtype

CategoricalDtype(categories=['DK', 'NOT TOO HAPPY', 'PRETTY HAPPY', 'VERY HAPPY'], ordered=False)

## Wrangling Categorical Variables

**Exercise:** Do the same with the variable `degree`.

In [11]:
## Your code here

## Wrangling Categorical Variables

There are two ways to create a Categorical Series in pandas.

- By the way, in pandas, the variables (data columns) are called *series*.

Way 1:

In [12]:
myvar = ['Happy', 'Happy', 'Sad', 'Happy', 'Sad', 'Meh']
myseries = pd.Series(myvar, dtype = 'category')
myseries

0    Happy
1    Happy
2      Sad
3    Happy
4      Sad
5      Meh
dtype: category
Categories (3, object): ['Happy', 'Meh', 'Sad']

Way 2 (better!):

In [13]:
myseries2 = pd.Categorical(myvar, 
                           categories = ['Sad', 'Meh', 'Happy'],
                           ordered = True)
myseries2

['Happy', 'Happy', 'Sad', 'Happy', 'Sad', 'Meh']
Categories (3, object): ['Sad' < 'Meh' < 'Happy']

## Wrangling Categorical Variables

**Exercise:** Do the same with the variable `degree`.

In [14]:
## Your code here

## Wrangling Categorical Variables

Why to store variables as categorical?

1. Methods built-in to deal with the data
1. We can specify order
1. Very important but neglected: Memory issues

In [15]:
vobject = dat.polviews
vobject.nbytes

427792

In [16]:
vobject = dat.polviews.astype('category')
vobject.nbytes

53538

## Wrangling Categorical Variables

And we can specify all that at the beginning stage, when loading the data:

In [17]:
data_types = {'happy': 'category'}
dat = pd.read_csv('gss.csv',
                 dtype = data_types)
dat.dtypes

region        object
income        object
happy       category
age          float64
finrela       object
marital       object
degree        object
health        object
wrkstat       object
partyid       object
polviews      object
sex           object
year           int64
dtype: object

Doing this step at this point saves lots of memory!

## Wrangling Categorical Variables

We did some group by. Let's do it again with categorical data now.

In [18]:
## Group by (pipe-ing)
(dat.groupby('happy')
    .age
    .mean())

happy
DK               51.500000
NOT TOO HAPPY    46.130420
PRETTY HAPPY     44.632033
VERY HAPPY       46.953361
Name: age, dtype: float64

## Wrangling Categorical Variables

**Exercise:** Do the same with the variables `degree` and `age`.

In [19]:
## Your code here

## Wrangling Categorical Variables

We can do that with multiple categories.

In [20]:
## Group by (pipe-ing)
(dat.groupby(['happy', 'health'])
    .age
    .mean())

happy          health   
DK             DK                 NaN
               EXCELLENT          NaN
               FAIR         51.500000
               GOOD               NaN
               IAP                NaN
               POOR               NaN
NOT TOO HAPPY  DK           53.500000
               EXCELLENT    40.870445
               FAIR         47.171703
               GOOD         42.869019
               IAP          49.242718
               POOR         56.920000
PRETTY HAPPY   DK                 NaN
               EXCELLENT    39.277410
               FAIR         51.043950
               GOOD         43.226660
               IAP          47.133772
               POOR         60.323988
VERY HAPPY     DK                 NaN
               EXCELLENT    43.093289
               FAIR         55.592857
               GOOD         47.616559
               IAP          50.044554
               POOR         61.384401
Name: age, dtype: float64

## Wrangling Categorical Variables

**Exercise:** Do the same with the variables `degree`, `happy`, and `age`.

In [21]:
## Your code here

## Wrangling Categorical Variables

The weird values are because some combinations have no data.

We check the number of cases within category using the `.size()` function:

In [22]:
## Groupby + size
(dat.groupby(['happy', 'health'])
    .size())

happy          health   
DK             DK              0
               EXCELLENT       0
               FAIR            2
               GOOD            0
               IAP             0
               POOR            0
NOT TOO HAPPY  DK              2
               EXCELLENT     745
               FAIR         1464
               GOOD         1675
               IAP           103
               POOR          730
PRETTY HAPPY   DK              0
               EXCELLENT    5290
               FAIR         3969
               GOOD         9971
               IAP           456
               POOR          965
VERY HAPPY     DK              0
               EXCELLENT    5171
               FAIR         1402
               GOOD         4665
               IAP           202
               POOR          361
dtype: int64

## Wrangling Categorical Variables

**Exercise:** Do the same with the variables `degree` and `happy`.

In [23]:
## Your code here

## Wrangling Categorical Variables

In [24]:
## Groupby + dropna
(dat.groupby(['happy', 'health'])
    .age
    .mean()
    .dropna())

happy          health   
DK             FAIR         51.500000
NOT TOO HAPPY  DK           53.500000
               EXCELLENT    40.870445
               FAIR         47.171703
               GOOD         42.869019
               IAP          49.242718
               POOR         56.920000
PRETTY HAPPY   EXCELLENT    39.277410
               FAIR         51.043950
               GOOD         43.226660
               IAP          47.133772
               POOR         60.323988
VERY HAPPY     EXCELLENT    43.093289
               FAIR         55.592857
               GOOD         47.616559
               IAP          50.044554
               POOR         61.384401
Name: age, dtype: float64

## Wrangling Categorical Variables

**Exercise:** Do the same with the variables `degree`, `happy`, and `age`, removing the missing values.

In [25]:
## Your code here

## Wrangling Categorical Variables

The third step in here is to set the categorical variables parameters in a way that would help the analysis.

Let us look at the variable `polviews`. We start by setting it to categorical and see what we have there:

In [26]:
dat.polviews = dat.polviews.astype('category')
(dat.polviews
    .value_counts(dropna = False))

MODERATE                17781
SLGHTLY CONSERVATIVE     7423
NaN                      7411
CONSERVATIVE             6800
SLIGHTLY LIBERAL         5973
LIBERAL                  5338
EXTRMLY CONSERVATIVE     1438
EXTREMELY LIBERAL        1249
DK                         61
Name: polviews, dtype: int64

## Wrangling Categorical Variables

To see some more details, let us separate the series:

In [27]:
polv = dat.polviews
polv

0                         NaN
1                         NaN
2                         NaN
3                         NaN
4                         NaN
                 ...         
53469    SLGHTLY CONSERVATIVE
53470    EXTRMLY CONSERVATIVE
53471        SLIGHTLY LIBERAL
53472        SLIGHTLY LIBERAL
53473                MODERATE
Name: polviews, Length: 53474, dtype: category
Categories (8, object): ['CONSERVATIVE', 'DK', 'EXTREMELY LIBERAL', 'EXTRMLY CONSERVATIVE', 'LIBERAL', 'MODERATE', 'SLGHTLY CONSERVATIVE', 'SLIGHTLY LIBERAL']

## Wrangling Categorical Variables

### Set Categories

In [28]:
polv2 = polv.cat.set_categories(
    new_categories = ['EXTREMELY LIBERAL', 
                      'LIBERAL', 
                      'SLIGHTLY LIBERAL', 
                      'MODERATE', 
                      'SLGHTLY CONSERVATIVE', 
                      'CONSERVATIVE', 
                      'EXTRMLY CONSERVATIVE']
)
print(polv2.value_counts(dropna = False), end = '\n\n\n')
print(polv.value_counts(dropna = False), end = '\n\n\n')

MODERATE                17781
NaN                      7472
SLGHTLY CONSERVATIVE     7423
CONSERVATIVE             6800
SLIGHTLY LIBERAL         5973
LIBERAL                  5338
EXTRMLY CONSERVATIVE     1438
EXTREMELY LIBERAL        1249
Name: polviews, dtype: int64


MODERATE                17781
SLGHTLY CONSERVATIVE     7423
NaN                      7411
CONSERVATIVE             6800
SLIGHTLY LIBERAL         5973
LIBERAL                  5338
EXTRMLY CONSERVATIVE     1438
EXTREMELY LIBERAL        1249
DK                         61
Name: polviews, dtype: int64




## Wrangling Categorical Variables

**Exercise:** Do the same with the variable `degree`

In [29]:
## Your code here

## Wrangling Categorical Variables

### Set Categories

In [30]:
polv3 = polv.cat.set_categories(
    new_categories = ['EXTREMELY LIBERAL', 'LIBERAL', 'SLIGHTLY LIBERAL', 'MODERATE', 'SLGHTLY CONSERVATIVE', 'CONSERVATIVE', 'EXTRMLY CONSERVATIVE'],
    ordered = True
)
print(polv3.value_counts(dropna = False, sort = False), end = '\n\n\n')
print(polv.value_counts(dropna = False), end = '\n\n\n')

EXTREMELY LIBERAL        1249
LIBERAL                  5338
SLIGHTLY LIBERAL         5973
MODERATE                17781
SLGHTLY CONSERVATIVE     7423
CONSERVATIVE             6800
EXTRMLY CONSERVATIVE     1438
NaN                      7472
Name: polviews, dtype: int64


MODERATE                17781
SLGHTLY CONSERVATIVE     7423
NaN                      7411
CONSERVATIVE             6800
SLIGHTLY LIBERAL         5973
LIBERAL                  5338
EXTRMLY CONSERVATIVE     1438
EXTREMELY LIBERAL        1249
DK                         61
Name: polviews, dtype: int64




## Wrangling Categorical Variables

**Exercise:** Do the same with the variable `degree`

In [31]:
## Your code here

## Wrangling Categorical Variables

### Add and Remove Categories

We can also add and remove categories. If we add categories that are not there, they will have a zero count:

In [32]:
polv4 = polv.cat.add_categories(
    new_categories = [
        'very very liberal',
        'very very conservative'],
)
print(polv4.value_counts(dropna = False, sort = False))

CONSERVATIVE               6800
DK                           61
EXTREMELY LIBERAL          1249
EXTRMLY CONSERVATIVE       1438
LIBERAL                    5338
MODERATE                  17781
SLGHTLY CONSERVATIVE       7423
SLIGHTLY LIBERAL           5973
very very liberal             0
very very conservative        0
NaN                        7411
Name: polviews, dtype: int64


## Wrangling Categorical Variables

**Exercise:** Do the same with the variable `degree`, creating one imaginary degree.

In [33]:
## Your code here

## Wrangling Categorical Variables

### Add and Remove Categories

In [34]:
polv5 = polv4.cat.remove_categories(
    removals = ['MODERATE']
)
print(polv5.value_counts(dropna = False, sort = False))

CONSERVATIVE               6800
DK                           61
EXTREMELY LIBERAL          1249
EXTRMLY CONSERVATIVE       1438
LIBERAL                    5338
SLGHTLY CONSERVATIVE       7423
SLIGHTLY LIBERAL           5973
very very liberal             0
very very conservative        0
NaN                       25192
Name: polviews, dtype: int64


## Wrangling Categorical Variables

**Exercise:** Remove the imaginary degree you created (`degree`).

In [35]:
## Your code here

# Great work!