# Involution in Chinese Society

Author: Rudan Zheng (rudanooo.z@gmail.com)

Course Project, UC Irvine, Math 10, Summer 2023

## Introduction

I would like to study "The Involution of Chinese Society". The "Involution of Chinese Society" refers to the self-perpetuating cycle of increasing competition, pressure, and expectations in Chinese society that leads to overwork, exhaustion, and diminishing returns. I would like to model a Prisoner's Dilemma for this sitruation, therefore I would like to determine the relationship between education level and job income. Before obtaining the regression model, I can compare people's true job income to average job income corresponding to same education level and determine their behavior (overworked or not). After obtaining this data and their behavior, I can obtain a logistic regression model to analyze the relationship between education level, job income, and overworked behavior.

## Screen Data

In this section, I filtered the available data by different conditions.

In [1]:
pip install altair==5.0.0

Collecting altair==5.0.0
  Downloading altair-5.0.0-py3-none-any.whl (477 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m477.4/477.4 KB[0m [31m50.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-extensions>=4.0.1
  Downloading typing_extensions-4.8.0-py3-none-any.whl (31 kB)
Collecting toolz
  Downloading toolz-0.12.0-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.8/55.8 KB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jinja2
  Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.1/133.1 KB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting numpy
  Downloading numpy-1.24.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m45.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pandas>=0.18
  Downloading pandas-2.0.3-cp38-cp38-man

In [2]:
import pandas as pd
import altair as alt

In [3]:
alt.__version__

'5.0.0'

In [4]:
df = pd.read_csv("cfps2020edu_income.csv")
df

Unnamed: 0,age,gender,provcd20,provcd GDP/person rank,edu status,urban,job income
0,51,0,11,1.0,-8,1,63000
1,54,1,11,1.0,-8,1,95000
2,31,1,12,6.0,7,1,100000
3,23,0,13,26.0,4,1,-8
4,30,0,13,26.0,7,1,50000
...,...,...,...,...,...,...,...
28525,61,1,62,31.0,-8,0,-8
28526,48,1,65,18.0,-8,1,-8
28527,47,0,65,18.0,-8,1,-8
28528,47,0,35,4.0,#NULL!,-9,-8


In [5]:
df = df[df["job income"] > 1000]
df

Unnamed: 0,age,gender,provcd20,provcd GDP/person rank,edu status,urban,job income
0,51,0,11,1.0,-8,1,63000
1,54,1,11,1.0,-8,1,95000
2,31,1,12,6.0,7,1,100000
4,30,0,13,26.0,7,1,50000
5,32,1,13,26.0,5,1,8500
...,...,...,...,...,...,...,...
28519,41,0,32,3.0,4,1,40000
28521,34,0,12,6.0,7,1,78000
28522,49,0,62,31.0,-8,1,25000
28524,25,0,35,4.0,5,1,50000


In [6]:
df = df[df["edu status"].str.contains('[5,6,7,8,9]', na=False)]
df = df[df["edu status"] != "-8"]
df

Unnamed: 0,age,gender,provcd20,provcd GDP/person rank,edu status,urban,job income
2,31,1,12,6.0,7,1,100000
4,30,0,13,26.0,7,1,50000
5,32,1,13,26.0,5,1,8500
11,34,0,21,19.0,7,1,100000
12,33,0,21,19.0,7,1,30000
...,...,...,...,...,...,...,...
28510,39,1,13,26.0,6,1,30000
28516,33,0,22,27.0,5,1,16000
28521,34,0,12,6.0,7,1,78000
28524,25,0,35,4.0,5,1,50000


I drop "-8" is becuse it is meaningless data, if the test taker don't want to give information, it will show -8 in the test result. 

In [7]:
df = df[df["urban"] != -9]
df

Unnamed: 0,age,gender,provcd20,provcd GDP/person rank,edu status,urban,job income
2,31,1,12,6.0,7,1,100000
4,30,0,13,26.0,7,1,50000
5,32,1,13,26.0,5,1,8500
11,34,0,21,19.0,7,1,100000
12,33,0,21,19.0,7,1,30000
...,...,...,...,...,...,...,...
28507,32,1,43,15.0,5,0,70000
28510,39,1,13,26.0,6,1,30000
28516,33,0,22,27.0,5,1,16000
28521,34,0,12,6.0,7,1,78000


Similarly, "-9" is meaningless data for "urban", so I drop it too.

In [8]:
def classify_rank(rank):
    if 1 <= rank < 11:
        return '0'
    elif 11 <= rank < 21:
        return '1'
    else: 
        return '2'

In [9]:
df["Classification"] = df["provcd GDP/person rank"].apply(classify_rank)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Classification"] = df["provcd GDP/person rank"].apply(classify_rank)


In the previous work, I screen data by some parameters, only keep those meaningful data. Here are some explanations of each parameters:
1. Edu status: 5: high school/junior high school/technical school/vocational high school 6: community college 7: undergraduate 8: master 9: Ph.D
2. Gender: 0:Female, 1:Male
3. provcd20: it is code of difference province
4. provcd GDP/person rank: it is the rank of different provinces' GDP/person, it indicate people's life quality. 1 means the highest GDP/person, 31 means the lowest GDP/person.
5. Classification: I classify the provcd GDP/person rank into three categories ("high", "middle", and "low”). In the data set, 0: high, 1:middle, 2:low
6. Urban: 1: urban, 0: rural area

In [10]:
df['job income_edu'] = df['edu status'].map(df.groupby('edu status')['job income'].mean())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['job income_edu'] = df['edu status'].map(df.groupby('edu status')['job income'].mean())


In [11]:
df["overwork"] = (df["job income"] < df["job income_edu"]).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["overwork"] = (df["job income"] < df["job income_edu"]).astype(int)


I add a new column "job income_edu" in df which contains the average 'job income' for each 'edu status'.  df["overwork"] means if one obserbation's "job income" larger than mean of the job income his or her "edu status", the "job income_edu", then he will not overwork to get higher education status and higher. else, he will continue to work and improve his edu status, which means overwork.

Overwork: 1 means true (overwork), 0 means false (not overwork).

## Data Visualizations

In [12]:
edu_order = ["5","6","7","8","9"]
c1 = alt.Chart(df, width=400, height=800).mark_point(size=100).encode(
    x=alt.X("edu status:O", sort=edu_order, title="edu status"), 
    y=alt.Y("job income:Q",title="job income"),
    color=alt.Color("Classification:O", scale=alt.Scale(scheme="tableau10")),
    tooltip=["edu status:O","job income:Q","Classification:O"]
)

c1

In [13]:
df_sub = df[df["job income"] <= 20000].copy()
c2 = alt.Chart(df_sub, width=400, height=800).mark_point(size=100).encode(
    x=alt.X("edu status:O", sort=edu_order, title="edu status"), 
    y=alt.Y("job income:Q",title="job income"),
    color=alt.Color("Classification:O", scale=alt.Scale(scheme="dark2")),
    tooltip=["edu status:O","job income:Q","Classification:O"]
)

c2

c1 shows the distribution of "job income" and "edu status" across the entire dataset, and I color-coded the "Classification"  However, it is difficult to observe the details of "job income" between 0-200,000 from c1, so I have narrowed the data for further observation. As can be seen in c2: "edu status" and corresponding "job income" seem to be lower in provinces with lower GDP per capita rankings, i.e., "Classification = 2".

## Logistic Regression

I want to figure out the relationship between "edu status", "Classification" and "overwork", to be more specific, "edu status" and "Classification" are explanatory variables and "overwork" is response variable in my model. Because they are all categorical variable, I will use logistic regression model to do that.

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [15]:
cols = ["edu status","Classification"]
df_dum = df.copy()
df_dum = pd.get_dummies(df_dum, columns=cols, drop_first=True)
df_dum

Unnamed: 0,age,gender,provcd20,provcd GDP/person rank,urban,job income,job income_edu,overwork,edu status_6,edu status_7,edu status_8,edu status_9,Classification_1,Classification_2
2,31,1,12,6.0,1,100000,76244.589792,0,0,1,0,0,0,0
4,30,0,13,26.0,1,50000,76244.589792,1,0,1,0,0,0,1
5,32,1,13,26.0,1,8500,45919.575921,1,0,0,0,0,0,1
11,34,0,21,19.0,1,100000,76244.589792,0,0,1,0,0,1,0
12,33,0,21,19.0,1,30000,76244.589792,1,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28507,32,1,43,15.0,0,70000,45919.575921,0,0,0,0,0,1,0
28510,39,1,13,26.0,1,30000,52510.332378,1,1,0,0,0,0,1
28516,33,0,22,27.0,1,16000,45919.575921,1,0,0,0,0,0,1
28521,34,0,12,6.0,1,78000,76244.589792,0,0,1,0,0,0,0


Since my explanatory variables are categorical, I need to create dummy variables for them to implement logistic regression.

In [16]:
df_dum.columns

Index(['age', 'gender', 'provcd20', 'provcd GDP/person rank', 'urban',
       'job income', 'job income_edu', 'overwork', 'edu status_6',
       'edu status_7', 'edu status_8', 'edu status_9', 'Classification_1',
       'Classification_2'],
      dtype='object')

In [17]:
X = df_dum.loc[:, ['edu status_6', 'edu status_7', 'edu status_8',
       'edu status_9', 'Classification_1', 'Classification_2']]

In [18]:
X

Unnamed: 0,edu status_6,edu status_7,edu status_8,edu status_9,Classification_1,Classification_2
2,0,1,0,0,0,0
4,0,1,0,0,0,1
5,0,0,0,0,0,1
11,0,1,0,0,1,0
12,0,1,0,0,1,0
...,...,...,...,...,...,...
28507,0,0,0,0,1,0
28510,1,0,0,0,0,1
28516,0,0,0,0,0,1
28521,0,1,0,0,0,0


In this part, the "edu status_5" and the "Classification_0" are base group, thus they are not showing on the "X".

In [19]:
y = df_dum["overwork"]

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [21]:
clf = LogisticRegression()

In [22]:
clf.fit(X_train, y_train)

In [23]:
clf.score(X_train,y_train)

0.669386002576213

In [24]:
clf.score(X_test,y_test)

0.6886886886886887

The difference between reg.score on the test set and train set is less than 5%. Thus, overfitting might not be a serious concern.

In [25]:
y_pred = clf.predict(X_test)

In [26]:
clf_mse = mean_squared_error(y_test, y_pred)
clf_mse

0.3113113113113113

The mean squared error of my model's predictions on the test data is not very large, indicating that the predictions are work but still need improvement. Additionally, the model's clf.score on the test set is about 0.667, signifying that the model explains 66.7% of data. I think I can drop off some extrem outlier to improve my model. Besides that, I can also add interaction variables between "edu status" and "Classification" to improve my model, since a higher quality of life is associated with a higher probability of receiving higher education.

## Decision Tree

In [27]:
from sklearn import tree
dec = tree.DecisionTreeClassifier()
dec = dec.fit(X,y)

In [28]:
!pip install graphviz

Collecting graphviz
  Downloading graphviz-0.20.1-py3-none-any.whl (47 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.0/47.0 KB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: graphviz
Successfully installed graphviz-0.20.1
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [29]:
import graphviz

In [30]:
dot_data = tree.export_graphviz(dec, feature_names=X.columns, class_names=["Not Overwork", "Overwork"], filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
graph.render("Involution in China")

'Involution in China.pdf'

This part of the process involves generating a decision tree that determines whether people are overworked based on the “job income" that corresponds to their "edu status".

## Summary

Either summarize what you did, or summarize the results.  Maybe 3 sentences.

First I applied pandas to weed out the meaningless data in preparation for the logistic regression analysis later. During the data processing I added my own definitions of the data, for example, “job income" higher than the average for that "edu status" the "overwork" is false and vice versa. In the logistic regression process, I used dummy variables to analyze "edu status" and "job income" as explanatary variables and "overwork" as response variable. Finally, based on the results of the analysis, a decision tree is drawn.

## References

Your code above should include references.  Here is some additional space for references.

* What is the source of your dataset(s)?

My dataset is coming from China Family Panel Studies. The China Family Panel Survey (CFPS) is a biennial survey of communities, families and individuals in China sponsored by the Institute for Social Science Surveys (ISSS) at Peking University in China. The survey aims to collect longitudinal data at the individual, household and community levels in contemporary China. The study focuses on the economic and non-economic well-being of the Chinese population and contains a wealth of information on topics such as economic activity and educational outcomes. This is my dataset website, you can request an account to view the full data. My data comes from the 2020 Individual Survey.https://www.isss.pku.edu.cn/cfps/en/

* List any other references that you found helpful.

Dummy variables: https://www.geeksforgeeks.org/how-to-create-dummy-variables-in-python-with-pandas/

Decision tree reference: https://scikit-learn.org/stable/modules/tree.html

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=6e2a7160-e058-46f7-ba6a-7e2a02537d6b' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>