One of the main ways for working with categorical variables is using 0, 1 encodings.  In this technique, you create a new column for every level of the categorical variable.  The **advantages** of this approach include:

    1. The ability to have differing influences of each level on the response.
    2. You do not impose a rank of the categories.
    3. The ability to interpret the results more easily than other encodings.
    
The **disadvantages** of this approach are that you introduce a large number of effects into your model.  If you have a large number of categorical variables or categorical variables with a large number of levels, but not a large sample size, you might not be able to estimate the impact of each of these variables on your response variable.  There are some rules of thumb that suggest 10 data points for each variable you add to your model.  That is 10 rows for each column.  This is a reasonable lower bound, but the larger your sample (assuming it is representative), the better.

Let's try out adding dummy variables for the categorical variables into the model.  We will compare to see the improvement over the original model only using quantitative variables.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
%matplotlib inline

In [4]:
df = pd.read_csv('./survey_results_public.csv')

In [12]:
cat_df = df.select_dtypes(include = ['object']).copy()

In [16]:
cat_df.shape

(19102, 147)

In [23]:
cat_df.dropna(axis = 1, how = "any")

Unnamed: 0,Professional,ProgramHobby,Country,University,EmploymentStatus,FormalEducation
0,Student,"Yes, both",United States,No,"Not employed, and not looking for work",Secondary school
1,Student,"Yes, both",United Kingdom,"Yes, full-time",Employed part-time,Some college/university study without earning ...
2,Professional developer,"Yes, both",United Kingdom,No,Employed full-time,Bachelor's degree
3,Professional non-developer who sometimes write...,"Yes, both",United States,No,Employed full-time,Doctoral degree
4,Professional developer,"Yes, I program as a hobby",Switzerland,No,Employed full-time,Master's degree
...,...,...,...,...,...,...
19097,Professional developer,"Yes, I program as a hobby",Canada,No,Employed full-time,Bachelor's degree
19098,Student,"Yes, I program as a hobby",India,No,"Not employed, and not looking for work",Secondary school
19099,Professional non-developer who sometimes write...,"Yes, I program as a hobby",United Kingdom,No,"Independent contractor, freelancer, or self-em...",Bachelor's degree
19100,Professional developer,"Yes, I program as a hobby",United States,No,Employed full-time,Some college/university study without earning ...


In [74]:
len(cat_df.columns[cat_df.isnull().mean() >= 0.5])

49

In [60]:
len(cat_df.columns[cat_df.isnull().mean() >= 0.5])

49

In [99]:
np.sum(df.isnull()) / df.shape()

Respondent                  0
Professional                0
ProgramHobby                0
Country                     0
University                  0
                        ...  
QuestionsInteresting     6366
QuestionsConfusing       6396
InterestedAnswers        6342
Salary                  14093
ExpectedSalary          18284
Length: 154, dtype: int64