<html>
    <div style="color:#363945; background-color:#E1F6FF; display: block">
        <h1> Data Exploration - Notebook Summary </h1>
            <ol>
                <li> Read raw data collected from <strong>2020</strong> StackOverflow Survey. [ <a href = "https://insights.stackoverflow.com/survey" >StackOverflow Surveys</a> ]</li>
                <li> Discovering the <strong>shape</strong> of the data (number of answers in the survey).</li>
                <li> Displaying a <strong>random answer</strong> to explore it's different attributes.</li>
                <li> Exploring the <strong>format</strong> of different attributes.</li>
                <li> Exploring some <strong>simple stats</strong> in the dataset.</li>
            </ol><br>
    </div>
</html>

### Loading Data

In [1]:
DATA_PATH = "../data/raw/survey_results_public.csv"

In [2]:
# Import libraries
import numpy as np
import pandas as pd
pd.options.display.max_rows = 1000 

In [3]:
# Loading data and printing shape
raw_df = pd.read_csv(DATA_PATH)
raw_df.shape

(64461, 61)

In [4]:
# Display random answer 
# Observations: Multiple answers need to be splitted 
# Reference to the schema needed to understand 
raw_df.sample(1).iloc[0]

Respondent                                                                  26721
MainBranch                                 I am a student who is learning to code
Hobbyist                                                                      Yes
Age                                                                          17.0
Age1stCode                                                                      8
CompFreq                                                                      NaN
CompTotal                                                                     NaN
ConvertedComp                                                                 NaN
Country                                                             United States
CurrencyDesc                                                                  NaN
CurrencySymbol                                                                NaN
DatabaseDesireNextYear                                                        NaN
DatabaseWorkedWi

In [5]:
# Print Data general information
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64461 entries, 0 to 64460
Data columns (total 61 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Respondent                    64461 non-null  int64  
 1   MainBranch                    64162 non-null  object 
 2   Hobbyist                      64416 non-null  object 
 3   Age                           45446 non-null  float64
 4   Age1stCode                    57900 non-null  object 
 5   CompFreq                      40069 non-null  object 
 6   CompTotal                     34826 non-null  float64
 7   ConvertedComp                 34756 non-null  float64
 8   Country                       64072 non-null  object 
 9   CurrencyDesc                  45472 non-null  object 
 10  CurrencySymbol                45472 non-null  object 
 11  DatabaseDesireNextYear        44070 non-null  object 
 12  DatabaseWorkedWith            49537 non-null  object 
 13  D

In [6]:
# Get stats for the numerical column
raw_df.describe()

Unnamed: 0,Respondent,Age,CompTotal,ConvertedComp,WorkWeekHrs
count,64461.0,45446.0,34826.0,34756.0,41151.0
mean,32554.079738,30.834111,3.190464e+242,103756.1,40.782174
std,18967.44236,9.585392,inf,226885.3,17.816383
min,1.0,1.0,0.0,0.0,1.0
25%,16116.0,24.0,20000.0,24648.0,40.0
50%,32231.0,29.0,63000.0,54049.0,40.0
75%,49142.0,35.0,125000.0,95000.0,44.0
max,65639.0,279.0,1.1111110000000001e+247,2000000.0,475.0


In [7]:
# There exists 3 columns that have an 'object' data type Despite of being 'integers'
# Investigate the questionable objects columns
questionable_cols = ['Age1stCode', 'YearsCode', 'YearsCodePro']

for col in questionable_cols:
    print(col)
    print(raw_df[col].unique().tolist())
    print("--------------------------\n")

Age1stCode
['13', '19', '15', '18', '16', '14', '12', '20', '42', '8', '25', '22', '30', '17', '21', '10', '46', '9', '7', '11', '6', nan, '31', '29', '5', 'Younger than 5 years', '28', '38', '23', '27', '41', '24', '53', '26', '35', '32', '40', '33', '36', '54', '48', '56', '45', '44', '34', 'Older than 85', '39', '51', '68', '50', '37', '47', '43', '52', '85', '64', '55', '58', '49', '76', '72', '73', '83', '63']
--------------------------

YearsCode
['36', '7', '4', '15', '6', '17', '8', '10', '35', '5', '37', '19', '9', '22', '30', '23', '20', '2', 'Less than 1 year', '3', '13', '25', '16', '43', '11', '38', '33', nan, '24', '21', '12', '40', '27', '50', '46', '14', '18', '28', '32', '44', '26', '42', '31', '34', '29', '1', '39', '41', '45', 'More than 50 years', '47', '49', '48']
--------------------------

YearsCodePro
['27', '4', nan, '8', '13', '2', '7', '20', '1', '23', '3', '12', '17', '18', '10', '14', '29', '6', '28', '9', '15', '11', '16', '25', 'Less than 1 year', '5', '2

# Some Simple Stats in the data

### What is the total number of respondents ?

In [8]:
print("There are \033[1m{}\033[0m Respondents in this survey".format(raw_df.shape[0]))

There are [1m64461[0m Respondents in this survey


### What are the top 10 countries involved in the survey ?

In [9]:
raw_df['Country'].value_counts().head(10)

United States     12469
India              8403
United Kingdom     3896
Germany            3890
Canada             2191
France             1898
Brazil             1818
Netherlands        1343
Poland             1278
Australia          1208
Name: Country, dtype: int64

### From which Origin are the respondents ? 

In [10]:
raw_df['Ethnicity'].value_counts().head(10)

White or of European descent                           29318
South Asian                                             4467
Hispanic or Latino/a/x                                  2256
Black or of African descent                             1690
Southeast Asian                                         1686
East Asian                                              1681
Middle Eastern                                          1622
Hispanic or Latino/a/x;White or of European descent      763
Middle Eastern;White or of European descent              378
Multiracial                                              292
Name: Ethnicity, dtype: int64

### What is the mean age of programmers?

In [11]:
print("Average Programmers involved in the survey are \033[1m{:.1f}\033[0m Years old.".format(raw_df['Age'].mean(axis = 0)))

Average Programmers involved in the survey are [1m30.8[0m Years old.
