## Map the headers to a column with pandas?

Data set: Stack Over Flow 2018 insights

* https://insights.stackoverflow.com/survey
* https://insights.stackoverflow.com/survey/2018#technology

Topics

* map a headers based on a value to a new column

Bonus

* pandas dot method - matrix multiplication
* understand np.where
* map single column of dataframe
* map all columns of a dataframe
* map and NaN
* check all distinct values in dataframe
* Optimize big data frames:
    * Columns have mixed types. Specify dtype option on import or set low_memory=False.


In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)

In [2]:
# read the data frame and see the data insight
df = pd.read_csv("../csv/stackoverflow/developer_survey_2018/survey_results_public.csv", low_memory=False)
print(df.shape)

(98855, 129)


In [3]:
# examples
df.head()

Unnamed: 0,Respondent,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,...,Exercise,Gender,SexualOrientation,EducationParents,RaceEthnicity,Age,Dependents,MilitaryUS,SurveyTooLong,SurveyEasy
0,1,Yes,No,Kenya,No,Employed part-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Mathematics or statistics,20 to 99 employees,Full-stack developer,...,3 - 4 times per week,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Black or of African descent,25 - 34 years old,Yes,,The survey was an appropriate length,Very easy
1,3,Yes,Yes,United Kingdom,No,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","A natural science (ex. biology, chemistry, physics)","10,000 or more employees",Database administrator;DevOps specialist;Full-stack developer;System administrator,...,Daily or almost every day,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",White or of European descent,35 - 44 years old,Yes,,The survey was an appropriate length,Somewhat easy
2,4,Yes,Yes,United States,No,Employed full-time,Associate degree,"Computer science, computer engineering, or software engineering",20 to 99 employees,Engineering manager;Full-stack developer,...,,,,,,,,,,
3,5,No,No,United States,No,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or software engineering",100 to 499 employees,Full-stack developer,...,I don't typically exercise,Male,Straight or heterosexual,Some college/university study without earning a degree,White or of European descent,35 - 44 years old,No,No,The survey was an appropriate length,Somewhat easy
4,7,Yes,No,South Africa,"Yes, part-time",Employed full-time,Some college/university study without earning a degree,"Computer science, computer engineering, or software engineering","10,000 or more employees",Data or business analyst;Desktop or enterprise applications developer;Game or graphics developer;QA or test developer;Student,...,3 - 4 times per week,Male,Straight or heterosexual,Some college/university study without earning a degree,White or of European descent,18 - 24 years old,Yes,,The survey was an appropriate length,Somewhat easy


In [4]:
# create new data frame with 3 columns
columns = ['Hobby', 'OpenSource', 'Student']
df_answers = df[columns]
df_answers.head()

Unnamed: 0,Hobby,OpenSource,Student
0,Yes,No,No
1,Yes,Yes,No
2,Yes,Yes,No
3,No,No,No
4,Yes,No,"Yes, part-time"


In [5]:
# map single column of dataframe
df_answers.Student.map( {'Yes':1, 'No':0}).head()

0    0.0
1    0.0
2    0.0
3    0.0
4   NaN 
Name: Student, dtype: float64

In [6]:
# check all distinct values in dataframe
df_answers.Student.unique()

array(['No', 'Yes, part-time', nan, 'Yes, full-time'], dtype=object)

In [7]:
# map all columns of a dataframe
import numpy as np
new_values = {'Yes':1, 'No':0, 'Yes, part-time':0, 'Yes, full-time':0, np.NaN:0}

df_answers = df_answers.apply(lambda x: x.map( new_values ))

df_answers.head()

Unnamed: 0,Hobby,OpenSource,Student
0,1,0,0
1,1,1,0
2,1,1,0
3,0,0,0
4,1,0,0


In [8]:
# map headers to columns way 1
df_answers['answer'] = np.where(df_answers, df_answers.columns, '').sum(axis=1)
df_answers.head()

Unnamed: 0,Hobby,OpenSource,Student,answer
0,1,0,0,Hobby
1,1,1,0,HobbyOpenSource
2,1,1,0,HobbyOpenSource
3,0,0,0,
4,1,0,0,Hobby


In [9]:
a = np.array([1, 2,3, 4, 9, 7, 8, 6])
np.where(a > 6)

(array([4, 5, 6]),)

In [10]:
a = np.array([1, 2,3, 4, 9, 7, 8, 6])
a[np.where(a > 6)]

array([9, 7, 8])

In [11]:
np.where([[True, False], [True, True]],
          [[1, 2], [3, 4]],
          [[9, 8], [7, 6]])

array([[1, 8],
       [3, 4]])

In [12]:
np.where([[False, False], [False, False]],
          [[1, 2], [3, 4]],
          [[9, 8], [7, 6]])

array([[9, 8],
       [7, 6]])

In [13]:
np.where([[True, True], [True, True]],
          [[1, 2], [3, 4]],
          [[9, 8], [7, 6]])

array([[1, 2],
       [3, 4]])

In [14]:
df_answers.drop('answer', axis=1, inplace=True)
df_answers.head()

Unnamed: 0,Hobby,OpenSource,Student
0,1,0,0
1,1,1,0
2,1,1,0
3,0,0,0
4,1,0,0


In [15]:
# map headers to columns way 2
df_answers.assign(answer=df_answers.dot(df_answers.columns)).head()

Unnamed: 0,Hobby,OpenSource,Student,answer
0,1,0,0,Hobby
1,1,1,0,HobbyOpenSource
2,1,1,0,HobbyOpenSource
3,0,0,0,
4,1,0,0,Hobby


In [16]:
a = pd.DataFrame([[1, 2], 
                  [4, 5]])
b = pd.DataFrame([[1, 0], 
                  [0, 1]])
a.dot(b)

Unnamed: 0,0,1
0,1,2
1,4,5


In [17]:
a = pd.DataFrame([[1, 2], 
                  [4, 5]])
b = pd.DataFrame([[2, 0], 
                  [0, 0]])
a.dot(b)

Unnamed: 0,0,1
0,2,0
1,8,0
