## Pandas count values in a column of type list?

Data set: Stack Over Flow 2018 insights

* https://insights.stackoverflow.com/survey
* https://insights.stackoverflow.com/survey/2018#technology

Topics

* expand list column
* value_counts for list column

Bonus

* combine head and tail 
* slicing iloc with range
* value_count on all columns
* sum per column
* do a sum of several columns
* sum all columns with iteration
* be careful when you chain operations with pandas

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_colwidth', None)

In [2]:
# read the data frame and see the data insight
df = pd.read_csv("../csv/stackoverflow/developer_survey_2018/survey_results_public.csv", low_memory=False)
print(df.shape)

(98855, 129)


In [3]:
df.head(2)

Unnamed: 0,Respondent,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,...,Exercise,Gender,SexualOrientation,EducationParents,RaceEthnicity,Age,Dependents,MilitaryUS,SurveyTooLong,SurveyEasy
0,1,Yes,No,Kenya,No,Employed part-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Mathematics or statistics,20 to 99 employees,Full-stack developer,...,3 - 4 times per week,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Black or of African descent,25 - 34 years old,Yes,,The survey was an appropriate length,Very easy
1,3,Yes,Yes,United Kingdom,No,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","A natural science (ex. biology, chemistry, physics)","10,000 or more employees",Database administrator;DevOps specialist;Full-stack developer;System administrator,...,Daily or almost every day,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",White or of European descent,35 - 44 years old,Yes,,The survey was an appropriate length,Somewhat easy


In [4]:
# combine head and tail variant 1
rows = 2
df.head(rows).append(df.tail(rows))

Unnamed: 0,Respondent,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,...,Exercise,Gender,SexualOrientation,EducationParents,RaceEthnicity,Age,Dependents,MilitaryUS,SurveyTooLong,SurveyEasy
0,1,Yes,No,Kenya,No,Employed part-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Mathematics or statistics,20 to 99 employees,Full-stack developer,...,3 - 4 times per week,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Black or of African descent,25 - 34 years old,Yes,,The survey was an appropriate length,Very easy
1,3,Yes,Yes,United Kingdom,No,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","A natural science (ex. biology, chemistry, physics)","10,000 or more employees",Database administrator;DevOps specialist;Full-stack developer;System administrator,...,Daily or almost every day,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",White or of European descent,35 - 44 years old,Yes,,The survey was an appropriate length,Somewhat easy
98853,101544,Yes,No,Russian Federation,No,"Independent contractor, freelancer, or self-employed",Some college/university study without earning a degree,,,,...,,,,,,,,,,
98854,101548,Yes,Yes,Cambodia,,,,,,,...,,,,,,,,,,


In [5]:
# combine head and tail variant 2
# ranges with iloc
rows = 2
df.iloc[np.r_[:rows, -rows:0]]

Unnamed: 0,Respondent,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,...,Exercise,Gender,SexualOrientation,EducationParents,RaceEthnicity,Age,Dependents,MilitaryUS,SurveyTooLong,SurveyEasy
0,1,Yes,No,Kenya,No,Employed part-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Mathematics or statistics,20 to 99 employees,Full-stack developer,...,3 - 4 times per week,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Black or of African descent,25 - 34 years old,Yes,,The survey was an appropriate length,Very easy
1,3,Yes,Yes,United Kingdom,No,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","A natural science (ex. biology, chemistry, physics)","10,000 or more employees",Database administrator;DevOps specialist;Full-stack developer;System administrator,...,Daily or almost every day,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",White or of European descent,35 - 44 years old,Yes,,The survey was an appropriate length,Somewhat easy
98853,101544,Yes,No,Russian Federation,No,"Independent contractor, freelancer, or self-employed",Some college/university study without earning a degree,,,,...,,,,,,,,,,
98854,101548,Yes,Yes,Cambodia,,,,,,,...,,,,,,,,,,


In [6]:
# get examples from column LanguageWorkedWith
rows = 5
df.LanguageWorkedWith.iloc[np.r_[:rows, -rows:0]]

0                              JavaScript;Python;HTML;CSS
1                            JavaScript;Python;Bash/Shell
2                                                     NaN
3        C#;JavaScript;SQL;TypeScript;HTML;CSS;Bash/Shell
4                      C;C++;Java;Matlab;R;SQL;Bash/Shell
98850                                                 NaN
98851                                                 NaN
98852                                                 NaN
98853                                                 NaN
98854                                                 NaN
Name: LanguageWorkedWith, dtype: object

In [7]:
# value counts for the same column
df.LanguageWorkedWith.value_counts().iloc[np.r_[:rows, -rows:0]]

C#;JavaScript;SQL;HTML;CSS                                                                            1347
JavaScript;PHP;SQL;HTML;CSS                                                                           1235
Java                                                                                                  1030
JavaScript;HTML;CSS                                                                                    881
C#;JavaScript;SQL;TypeScript;HTML;CSS                                                                  828
C;C++;C#;Java;Python;SQL;Swift;HTML;CSS;Bash/Shell                                                       1
C;C#;Java;JavaScript;PHP;Python;SQL;VBA;VB.NET;HTML;CSS;Bash/Shell                                       1
C#;Objective-C;PHP;Python;Swift;HTML;CSS;Bash/Shell                                                      1
C#;Java;JavaScript;Objective-C;Perl;PHP;Python;SQL;Swift;TypeScript;VBA;VB.NET;HTML;CSS;Bash/Shell       1
C#;CoffeeScript;F#;JavaScript;SQL;Typ

In [8]:
# expand the column on separator
df_lang = df.LanguageWorkedWith.str.split(';', expand=True)
df_lang.shape

(98855, 38)

In [9]:
df_lang = df_lang.dropna(how='all')
df_lang.shape

(78334, 38)

In [10]:
df_lang.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,28,29,30,31,32,33,34,35,36,37
0,JavaScript,Python,HTML,CSS,,,,,,,...,,,,,,,,,,
1,JavaScript,Python,Bash/Shell,,,,,,,,...,,,,,,,,,,
3,C#,JavaScript,SQL,TypeScript,HTML,CSS,Bash/Shell,,,,...,,,,,,,,,,
4,C,C++,Java,Matlab,R,SQL,Bash/Shell,,,,...,,,,,,,,,,
5,Java,JavaScript,Python,TypeScript,HTML,CSS,,,,,...,,,,,,,,,,


In [11]:
# get languages as count / numbers
# how to use value counts for the whole dataframe
df_lang_num = df_lang.fillna(0).apply(pd.Series.value_counts)
df_lang_num.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,28,29,30,31,32,33,34,35,36,37
Assembly,5760.0,,,,,,,,,,...,,,,,,,,,,
Bash/Shell,29.0,465.0,1221.0,1929.0,2882.0,4442.0,4844.0,4269.0,3311.0,2562.0,...,3.0,1.0,2.0,2.0,,1.0,,,2.0,35.0
C,13335.0,4707.0,,,,,,,,,...,,,,,,,,,,
C#,16969.0,4321.0,3990.0,1674.0,,,,,,,...,,,,,,,,,,
C++,7042.0,9275.0,3555.0,,,,,,,,...,,,,,,,,,,


In [12]:
# get languages as percentage / ratio
# value counts, parameters and lambda
df_lang_per = df_lang.fillna(0).apply(lambda x: pd.value_counts(x, normalize=True))
df_lang_per.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,28,29,30,31,32,33,34,35,36,37
Assembly,0.073531,,,,,,,,,,...,,,,,,,,,,
Bash/Shell,0.00037,0.005936,0.015587,0.024625,0.036791,0.056706,0.061838,0.054497,0.042268,0.032706,...,3.8e-05,1.3e-05,2.6e-05,2.6e-05,,1.3e-05,,,2.6e-05,0.000447
C,0.170233,0.060089,,,,,,,,,...,,,,,,,,,,
C#,0.216624,0.055161,0.050936,0.02137,,,,,,,...,,,,,,,,,,
C++,0.089897,0.118403,0.045383,,,,,,,,...,,,,,,,,,,


In [13]:
# why for value counts and parameters you need lambda
df_lang_per = df_lang.fillna(0).apply(pd.Series.value_counts(normalize=True))

TypeError: value_counts() missing 1 required positional argument: 'self'

In [14]:
# getting the percentage of use for each language
df_lang_per['total'] = df_lang_per.sum(axis=1)
df_lang_per.sort_values('total', ascending=False)['total'].head(10)

0             31.800036
JavaScript     0.698113
HTML           0.684607
CSS            0.650790
SQL            0.570250
Java           0.453456
Bash/Shell     0.397937
Python         0.387558
C#             0.344091
PHP            0.307287
Name: total, dtype: float64

In [15]:
# getting the number of use for each language
df_lang_num['total'] = df_lang_num.sum(axis=1)
df_lang_num.sort_values('total', ascending=False)['total'].head()

0             2491024.0
JavaScript      54686.0
HTML            53628.0
CSS             50979.0
SQL             44670.0
Name: total, dtype: float64

In [16]:
df_lang.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,28,29,30,31,32,33,34,35,36,37
0,JavaScript,Python,HTML,CSS,,,,,,,...,,,,,,,,,,
1,JavaScript,Python,Bash/Shell,,,,,,,,...,,,,,,,,,,
3,C#,JavaScript,SQL,TypeScript,HTML,CSS,Bash/Shell,,,,...,,,,,,,,,,
4,C,C++,Java,Matlab,R,SQL,Bash/Shell,,,,...,,,,,,,,,,
5,Java,JavaScript,Python,TypeScript,HTML,CSS,,,,,...,,,,,,,,,,


In [17]:
# get value counts for first column
df_lang[0].value_counts().head()

C#            16969
C             13335
JavaScript    12150
Java          12087
C++            7042
Name: 0, dtype: int64

In [18]:
# get value counts for second column
df_lang[1].value_counts().head()

JavaScript    19532
Java          10175
C++            9275
PHP            6450
C              4707
Name: 1, dtype: int64

In [19]:
# do a sum of several columns
df_comb_col = df_lang[0].value_counts(dropna=False) + df_lang[1].value_counts(dropna=False) + df_lang[2].value_counts(dropna=False)+ df_lang[3].value_counts(dropna=False)
df_comb_col.sort_values(ascending=False).head()

JavaScript    48938.0
Java          32991.0
C#            26954.0
SQL           24727.0
Python        19063.0
dtype: float64

In [20]:
df_comb = pd.DataFrame()
lang_index = []
df_lang.fillna(0, inplace=True)

In [21]:
# sum all columns in dataframe with iteration
for col in df_lang.columns:
    if col == 0:
        df_comb['total'] = df_lang[col].fillna(0).value_counts()
        lang_index = df_lang[col].value_counts().index
    else:
        col_ser = df_lang[col].fillna(0).value_counts()
        col_ser = col_ser.reindex(lang_index, fill_value=0)
        df_comb['total'] = df_comb['total'] + col_ser
df_comb.sort_values('total', ascending=False).head(rows).append(df_comb.tail(rows))


Unnamed: 0,total
JavaScript,54686
HTML,53628
CSS,50979
SQL,44670
Java,35521
Rust,1857
Kotlin,3508
Cobol,590
Ocaml,470
CSS,50979


In [22]:
df_comb = df_comb.sort_values('total', ascending=False)
df_comb.head(rows).append(df_comb.tail(rows))

Unnamed: 0,total
JavaScript,54686
HTML,53628
CSS,50979
SQL,44670
Java,35521
Erlang,886
Cobol,590
Ocaml,470
Julia,430
Hack,254


**Note**: In some cases the iteration example is not working properly - when the first column doesn't contain all values. It can be replaced with the example below:

In [24]:
df_comb = pd.DataFrame()
temp = []
val_count_tmp = pd.Series(dtype=float)

# sum all columns in dataframe with iteration
for col in df_lang.columns:
    temp.append(df_lang[col].fillna(0).value_counts())

for val_count in temp:
    val_count_tmp = val_count_tmp.add(val_count,fill_value=0)

y = val_count_tmp.dropna().drop(0) 
y.sort_values(ascending=False, inplace=True)
y.head(10).append(y.tail(10))

JavaScript              54686.0
HTML                    53628.0
CSS                     50979.0
SQL                     44670.0
Java                    35521.0
Bash/Shell              31172.0
Python                  30359.0
C#                      26954.0
PHP                     24071.0
C++                     19872.0
Delphi/Object Pascal     2025.0
Haskell                  1961.0
Rust                     1857.0
F#                       1115.0
Clojure                  1032.0
Erlang                    886.0
Cobol                     590.0
Ocaml                     470.0
Julia                     430.0
Hack                      254.0
dtype: float64