### Prequels/sequels

- **Kaggle Machine Learning & Data Science (data-prep)** | [Extended Dataset](https://www.kaggle.com/neomatrix369/kaggle-machine-learning-data-science-survey-ext) | [Additional Dataset](https://www.kaggle.com/neomatrix369/world-bank-data-1960-to-2016-extended)
- [Kaggle Global Outreach (analysis)](https://www.kaggle.com/neomatrix369/kaggle-global-outreach-analysis/)

## Installing and importing libraries and packages

In [None]:
!pip install -U missingno

In [None]:
!pip install fuzzywuzzy

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numexpr
import math

import warnings
warnings.filterwarnings('ignore')

import missingno as msno

import gc

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid", font_scale=1.75)

# prettify plots
plt.rcParams['figure.figsize'] = [20.0, 5.0]
sns.set_palette(sns.color_palette("muted"))

%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    print(dirname)
    for filename in filenames:
        print(os.path.join(dirname, filename))


# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Loading datasets

In [None]:
def strip_spaces_from_all_columns(dataframe: pd.DataFrame) -> pd.DataFrame:
    string_type_cols = dataframe.columns[
        (dataframe.dtypes == 'object') | 
        (dataframe.dtypes == 'str')
    ]
    
    for each_column in string_type_cols:
        dataframe[each_column] = dataframe[each_column].apply(lambda x: x.strip() if type(x) == 'str' else x)

    category_type_cols = dataframe.columns[dataframe.dtypes == 'category']
    for each_column in category_type_cols:
        dataframe[each_column] = dataframe[each_column].astype('str')
        dataframe[each_column] = dataframe[each_column].apply(lambda x: x.strip())
        dataframe[each_column] = dataframe[each_column].astype('category')
    
    if (len(string_type_cols) > 0) or (len(category_type_cols) > 0):
        print(f'Stripped leading and trailing spaces from these columns: {string_type_cols}, {category_type_cols}')

    return dataframe

In [None]:
### Reference: https://hackersandslackers.com/compare-rows-pandas-dataframes/
def dataframe_difference(first: pd.DataFrame, other: pd.DataFrame, which=None):
    """Find rows which are different between two DataFrames."""
    comparison_df = first.merge(
        other,
        indicator=True,
        how='inner'
    )
    if which is None:
        diff_df = comparison_df[comparison_df['_merge'] != 'both']
    else:
        diff_df = comparison_df[comparison_df['_merge'] == which]
    return diff_df

In [None]:
DATASET_UPLOAD_FOLDER='/kaggle/working/upload'
PREPROCESSED_DATASET_UPLOAD_FOLDER='/kaggle/working/upload/preprocessed-kaggle-2017-to-2020'
NOT_AVAILABLE = "Unknown / Not Specified"

In [None]:
%%bash
UPLOAD_FOLDER="/kaggle/working/upload"
PREPROCESSED_UPLOAD_FOLDER='/kaggle/working/upload/preprocessed-kaggle-2017-to-2020'
mkdir -p ${UPLOAD_FOLDER} ${PREPROCESSED_UPLOAD_FOLDER}
cp -fr /kaggle/input/stack-overflow-developer-survey-2020/* ${UPLOAD_FOLDER} || true

folders=( "kaggle-survey-2020" "kaggle-survey-2019" "kaggle-survey-2018" "kaggle-survey-2017" "so-survey-2017" "stack-overflow-developer-survey-2020" 
"cleaned-mcr-kaggle-survey-2019" "stack-overflow-2018-developer-survey" "stack-overflow-developer-survey-results-2019"
"kaggle-survey-20172020-merged-data" "world-development-indicators" )

for each_folder in "${folders[@]}"
do
    echo "~~~ Zipping files in ${each_folder}"
    cd "/kaggle/input/${each_folder}"
    zip -r9 ${UPLOAD_FOLDER}/${each_folder}.zip *
    echo "~~~ ${each_folder}.zip ready to be used/copied."
    echo ""
    cd ..
done

In [None]:
def check_missing_values(dataframe):
    missing_values_data = []
    for column in list(dataframe.columns):
        values = (dataframe[column].isna()).value_counts()
        value = 0
        if len(values) > 1:
            value = values[1]

        missing_values_data.append({'column_name': column, 'missing_values_count': value})
        
    df_missing_values = pd.DataFrame(missing_values_data)
    return df_missing_values.sort_values(by='missing_values_count', ascending=False)

### Preprocessing merged dataset from @harveenchadha (from 2017 to 2020)

In [None]:
kaggle_questions_and_responses_ref_df = pd.read_csv('../input/kaggle-survey-20172020-merged-data/kaggle_survey_17_20_v2.csv')
print(f"Before kaggle_questions_and_responses_ref_df: {kaggle_questions_and_responses_ref_df.shape[0]}")
kaggle_questions_and_responses_ref_df = kaggle_questions_and_responses_ref_df.fillna(NOT_AVAILABLE)
if 'index' in kaggle_questions_and_responses_ref_df.columns:
    kaggle_questions_and_responses_ref_df = kaggle_questions_and_responses_ref_df.drop('index', axis=1)
final_rows_count = kaggle_questions_and_responses_ref_df.shape[0]
print(f"After kaggle_questions_and_responses_ref_df: {final_rows_count}")
rows_count_after_dropping_duplicates = kaggle_questions_and_responses_ref_df.drop_duplicates(keep='first').shape[0]
print('if duplicates dropped in kaggle_questions_and_responses_ref_df: ' \
      f"{rows_count_after_dropping_duplicates} (difference: {final_rows_count - rows_count_after_dropping_duplicates})")
kaggle_questions_and_responses_ref_df.head()

### Picking a handful more columns missed in the above merge process

In [None]:
def load_dataset(filepath: str, column_mappings: dict, encoding='utf-8') -> pd.DataFrame:
    dataset = pd.read_csv(filepath, encoding=encoding)
    dataset = dataset.rename(columns=column_mappings)
    dataset = dataset.drop([0], errors='ignore')
    columns_found = list(set(column_mappings.values()) & set(dataset.columns))
    columns_missing = list(set(column_mappings.values()) - set(columns_found))
    if columns_missing:
        dataset[columns_missing] = NOT_AVAILABLE
        print(f'columns missing (replaced with "{NOT_AVAILABLE}" values):', columns_missing)
    return dataset[list(column_mappings.values())]

### Kaggle Survey 2017

In [None]:
question_id_to_human_readable_2017 = {
    'Age':'Age',
    'GenderSelect':'Gender',
    'Country':'Country',
    'FormalEducation':'Degree',
    'CurrentJobTitleSelect':'Job Title',
    'EmployerSize':'Company Size',
    #'EmployerSizeChange':'Team Size',
    #'EmployerMLTime':'ML Status in Company',
    'CompensationAmount':'Compensation Status',
    'N/A1' : 'Current role experience (in years)',
    'N/A2': 'Role Important at work',
    'N/A11': 'Programming language choice',
    'LanguageRecommendationSelect': 'Recommend Programming language', 
    'N/A12': 'Coding experience (in years)', ## CodeWriter field does not have the right data for this year
    'N/A3': 'Notebook product', 
    'N/A4': 'Computing platform', 
    'N/A5': '% of current ML/DS training categories',
    'HardwarePersonalProjectsSelect': 'Specialised HW',
    'N/A6': 'TPU Usage',      # not available in 2018
    'N/A7': 'ML Methods experience (in years)',
    'N/A8': 'Tools to manage ML experiments', # not available  in 2018
    'N/A9': 'Completed DS courses', # also check CoursePlatformSelect
    'N/A10':'Favourite media sources'
}
kaggle_2017 = load_dataset('../input/kaggle-survey-2017/multipleChoiceResponses.csv', question_id_to_human_readable_2017, encoding='latin-1')
kaggle_2017['Year'] = 2017
kaggle_2017.head()

### Kaggle Survey 2018

In [None]:
question_id_to_human_readable_2018 = {
    'Time from Start to Finish (seconds)':'Time',
    'Q2':'Age',
    'Q1':'Gender',
    'Q3':'Country',
    'Q4':'Degree',
    'Q6':'Job Title',
    # 'Q8':'Team Size', ## wrong
    # 'Q9':'Company Size', ## wrong                       
    'Q10':'ML Status in Company', 
    'Q9':'Compensation Status',
    # 'Q11':'Money Spent' ## wrong    
    'Q8' : 'Current role experience (in years)',
    'Q11_Part_1': 'Role Important at work 1',
    'Q11_Part_2': 'Role Important at work 2',
    'Q11_Part_3': 'Role Important at work 3',
    'Q11_Part_4': 'Role Important at work 4',
    'Q11_Part_5': 'Role Important at work 5',
    'Q11_Part_6': 'Role Important at work 6',
    'Q11_Part_7': 'Role Important at work 7',
    'Q11_OTHER_TEXT': 'Role Important at work 8',
    'Q17': 'Programming language choice', 
#     'N/A1': 'Recommend Programming language', 
    'Q14_Part_1': 'Notebook product 1', 
    'Q14_Part_2': 'Notebook product 2', 
    'Q14_Part_3': 'Notebook product 3',     
    'Q14_Part_4': 'Notebook product 4', 
    'Q14_Part_5': 'Notebook product 5', 
    'Q14_Part_6': 'Notebook product 6',     
    'Q14_Part_7': 'Notebook product 7', 
    'Q14_Part_8': 'Notebook product 8', 
    'Q14_Part_9': 'Notebook product 9',     
    'Q14_Part_10': 'Notebook product 10', 
    'Q14_Part_11': 'Notebook product 11',     
    'Q14_OTHER_TEXT': 'Notebook product 12',
    'Q15_Part_1': 'Cloud Computing platform 1', 
    'Q15_Part_2': 'Cloud Computing platform 2', 
    'Q15_Part_3': 'Cloud Computing platform 3', 
    'Q15_Part_4': 'Cloud Computing platform 4', 
    'Q15_Part_5': 'Cloud Computing platform 5', 
    'Q15_Part_6': 'Cloud Computing platform 6', 
    'Q15_Part_7': 'Cloud Computing platform 7',     
    'Q15_OTHER_TEXT': 'Computing platform 8',  
    'Q35_Part_1': '% of current ML/DS training categories 1',
    'Q35_Part_2': '% of current ML/DS training categories 2',
    'Q35_Part_3': '% of current ML/DS training categories 3',    
    'Q35_Part_4': '% of current ML/DS training categories 4',
    'Q35_Part_5': '% of current ML/DS training categories 5',
    'Q35_Part_6': '% of current ML/DS training categories 6',    
    'Q35_OTHER_TEXT': '% of current ML/DS training categories 7',
#     'N/A1': 'Specialised HW', # not available in 2018
#     'N/A2': 'TPU Usage',      # not available in 2018
#     'N/A3': 'ML Methods experience (in years)',  # not available in 2018
#     'N/A4': 'Tools to manage ML experiments', # not available  in 2018
    'Q36_Part_1': 'Completed DS courses 1',
    'Q36_Part_2': 'Completed DS courses 2',
    'Q36_Part_3': 'Completed DS courses 3',    
    'Q36_Part_4': 'Completed DS courses 4',
    'Q36_Part_5': 'Completed DS courses 5',
    'Q36_Part_6': 'Completed DS courses 6',    
    'Q36_Part_7': 'Completed DS courses 7',
    'Q36_Part_8': 'Completed DS courses 8',
    'Q36_Part_9': 'Completed DS courses 9',    
    'Q36_Part_10': 'Completed DS courses 10',
    'Q36_Part_11': 'Completed DS courses 11',
    'Q36_Part_12': 'Completed DS courses 12',    
    'Q36_Part_13': 'Completed DS courses 13',  
    'Q36_OTHER_TEXT': 'Completed DS courses 14',
    'Q38_Part_1':'Favourite media sources 1',
    'Q38_Part_2':'Favourite media sources 2',
    'Q38_Part_3':'Favourite media sources 3',
    'Q38_Part_4':'Favourite media sources 4',
    'Q38_Part_5':'Favourite media sources 5',    
    'Q38_Part_6':'Favourite media sources 6',
    'Q38_Part_7':'Favourite media sources 7',
    'Q38_Part_8':'Favourite media sources 8',
    'Q38_Part_9':'Favourite media sources 9',
    'Q38_Part_10':'Favourite media sources 10',
    'Q38_Part_11':'Favourite media sources 11',
    'Q38_Part_12':'Favourite media sources 12',
    'Q38_Part_13':'Favourite media sources 13',
    'Q38_Part_14':'Favourite media sources 14',
    'Q38_Part_15':'Favourite media sources 15',
    'Q38_Part_16':'Favourite media sources 16',
    'Q38_Part_17':'Favourite media sources 17',
    'Q38_Part_18':'Favourite media sources 18',    
    'Q38_Part_19':'Favourite media sources 19',        
    'Q38_Part_20':'Favourite media sources 20',        
    'Q38_Part_21':'Favourite media sources 21',
    'Q38_Part_22':'Favourite media sources 22',
    'Q38_OTHER_TEXT': 'Favourite media sources 23'
}
kaggle_2018 = load_dataset('../input/kaggle-survey-2018/multipleChoiceResponses.csv', question_id_to_human_readable_2018)
kaggle_2018['Year'] = 2018
kaggle_2018.head()

### Kaggle Survey 2019

In [None]:
question_id_to_human_readable_2019 = {
    'Time from Start to Finish (seconds)':'Time',
    'Q1':'Age',
    'Q2':'Gender',
    'Q3':'Country',
    'Q4':'Degree',
    'Q5':'Job Title',
    'Q6':'Team Size', ## wrong
#     'Q9':'Company Size', ## wrong                       
    'Q10':'Compensation Status',
    'Q8':'ML Status in Company',     
    # 'Q11':'Money Spent' ## wrong        
#     'N/A1' : 'Current role experience (in years)',
    'Q9_Part_1': 'Role Important at work 1',
    'Q9_Part_2': 'Role Important at work 2',
    'Q9_Part_3': 'Role Important at work 3',
    'Q9_Part_4': 'Role Important at work 4',
    'Q9_Part_5': 'Role Important at work 5',
    'Q9_Part_6': 'Role Important at work 6',
    'Q9_Part_7': 'Role Important at work 7',
    'Q9_OTHER_TEXT': 'Role Important at work 8',
    'Q18_Part_1': 'Programming language choice 1', 
    'Q18_Part_2': 'Programming language choice 2',     
    'Q18_Part_3': 'Programming language choice 3', 
    'Q18_Part_4': 'Programming language choice 4', 
    'Q18_Part_5': 'Programming language choice 5', 
    'Q18_Part_6': 'Programming language choice 6', 
    'Q18_Part_7': 'Programming language choice 7', 
    'Q18_Part_8': 'Programming language choice 8',   
    'Q18_Part_9': 'Programming language choice 9',   
    'Q18_Part_10': 'Programming language choice 10',   
    'Q18_Part_11': 'Programming language choice 11',   
    'Q18_Part_12': 'Programming language choice 12',       
    'Q19_OTHER_TEXT': 'Programming language choice 13',
    'Q19': 'Recommend Programming language', 
    'Q17_Part_1': 'Notebook product 1', 
    'Q17_Part_2': 'Notebook product 2', 
    'Q17_Part_3': 'Notebook product 3',     
    'Q17_Part_4': 'Notebook product 4', 
    'Q17_Part_5': 'Notebook product 5', 
    'Q17_Part_6': 'Notebook product 6',     
    'Q17_Part_7': 'Notebook product 7', 
    'Q17_Part_8': 'Notebook product 8', 
    'Q17_Part_9': 'Notebook product 9',     
    'Q17_Part_10': 'Notebook product 10', 
    'Q17_Part_11': 'Notebook product 11',    
    'Q17_Part_12': 'Notebook product 12',    
    'Q17_OTHER_TEXT': 'Notebook product 13',
    'Q29_Part_1': 'Cloud Computing platform 1', 
    'Q29_Part_2': 'Cloud Computing platform 2', 
    'Q29_Part_3': 'Cloud Computing platform 3', 
    'Q29_Part_4': 'Cloud Computing platform 4', 
    'Q29_Part_5': 'Cloud Computing platform 5', 
    'Q29_Part_6': 'Cloud Computing platform 6', 
    'Q29_Part_7': 'Cloud Computing platform 7',   
    'Q29_Part_7': 'Cloud Computing platform 8',   
    'Q29_Part_7': 'Cloud Computing platform 9',   
    'Q29_Part_7': 'Cloud Computing platform 10',   
    'Q29_Part_7': 'Cloud Computing platform 11',  
    'Q29_Part_7': 'Cloud Computing platform 12',    
    'Q29_OTHER_TEXT': 'Computing platform 13',  
#     'N/A2': '% of current ML/DS training categories 1',
#     'N/A3': '% of current ML/DS training categories 2',
#     'N/A4': '% of current ML/DS training categories 3',    
#     'N/A5': '% of current ML/DS training categories 4',
#     'N/A6': '% of current ML/DS training categories 5',
#     'N/A7': '% of current ML/DS training categories 6',    
#     'N/A8': '% of current ML/DS training categories 7',
    'Q21_Part_1': 'Specialised HW 1',
    'Q21_Part_2': 'Specialised HW 2',
    'Q21_Part_3': 'Specialised HW 3',
    'Q21_Part_4': 'Specialised HW 4',
    'Q21_Part_5': 'Specialised HW 5',
    'Q21_Part_6': 'Specialised HW 6',
    'Q21_OTHER_TEXT': 'Specialised HW 7',    
    'Q22': 'TPU Usage',      
    'Q23': 'ML Methods experience (in years)', 
#     'N/A9': 'Tools to manage ML experiments', 
    'Q13_Part_1': 'Completed DS courses 1',
    'Q13_Part_2': 'Completed DS courses 2',
    'Q13_Part_3': 'Completed DS courses 3',    
    'Q13_Part_4': 'Completed DS courses 4',
    'Q13_Part_5': 'Completed DS courses 5',
    'Q13_Part_6': 'Completed DS courses 6',    
    'Q13_Part_7': 'Completed DS courses 7',
    'Q13_Part_8': 'Completed DS courses 8',
    'Q13_Part_9': 'Completed DS courses 9',    
    'Q13_Part_10': 'Completed DS courses 10',
    'Q13_Part_11': 'Completed DS courses 11',
    'Q13_Part_12': 'Completed DS courses 12',    
    'Q13_Part_13': 'Completed DS courses 13',  
    'Q13_OTHER_TEXT': 'Completed DS courses 14',
    'Q12_Part_1':'Favourite media sources 1',
    'Q12_Part_2':'Favourite media sources 2',
    'Q12_Part_3':'Favourite media sources 3',
    'Q12_Part_4':'Favourite media sources 4',
    'Q12_Part_5':'Favourite media sources 5',    
    'Q12_Part_6':'Favourite media sources 6',
    'Q12_Part_7':'Favourite media sources 7',
    'Q12_Part_8':'Favourite media sources 8',
    'Q12_Part_9':'Favourite media sources 9',
    'Q12_Part_10':'Favourite media sources 10',
    'Q12_Part_11':'Favourite media sources 11',
    'Q12_Part_12':'Favourite media sources 12',
    'Q12_OTHER_TEXT': 'Favourite media sources 13'
}
kaggle_2019 = load_dataset('../input/kaggle-survey-2019/multiple_choice_responses.csv', question_id_to_human_readable_2019)
kaggle_2019['Year'] = 2019
kaggle_2019.head()

### Kaggle Survey 2020

In [None]:
question_id_to_human_readable_2020 = {
    'Time from Start to Finish (seconds)':'Time',
    'Q1':'Age',
    'Q2':'Gender',
    'Q3':'Country',
    'Q4':'Degree',
    'Q5':'Job Title',
    'Q20':'Company Size',
    'Q21':'Team Size',
    'Q22':'ML Status in Company',
    'Q24':'Compensation Status',
    'Q25':'Money Spent',
    'Q6': 'Coding experience (in years)', 
    'Q7_Part_1': 'Programming language choice 1', 
    'Q7_Part_2': 'Programming language choice 2', 
    'Q7_Part_3': 'Programming language choice 3', 
    'Q7_Part_4': 'Programming language choice 4', 
    'Q7_Part_5': 'Programming language choice 5', 
    'Q7_Part_6': 'Programming language choice 6',     
    'Q7_Part_7': 'Programming language choice 7',     
    'Q7_Part_8': 'Programming language choice 8',    
    'Q7_Part_9': 'Programming language choice 9', 
    'Q7_Part_10': 'Programming language choice 10', 
    'Q7_Part_11': 'Programming language choice 11', 
    'Q7_Part_12': 'Programming language choice 12',     
    'Q7_OTHER': 'Programming language choice 13', 
    'Q8': 'Recommend Programming language', 
    'Q10_Part_1': 'Notebook product 1', 
    'Q10_Part_2': 'Notebook product 2', 
    'Q10_Part_3': 'Notebook product 3',     
    'Q10_Part_4': 'Notebook product 4', 
    'Q10_Part_5': 'Notebook product 5', 
    'Q10_Part_6': 'Notebook product 6',     
    'Q10_Part_7': 'Notebook product 7', 
    'Q10_Part_8': 'Notebook product 8', 
    'Q10_Part_9': 'Notebook product 9',     
    'Q10_Part_10': 'Notebook product 10', 
    'Q10_Part_11': 'Notebook product 11',    
    'Q10_Part_12': 'Notebook product 12',    
    'Q10_Part_13': 'Notebook product 13',        
    'Q10_OTHER': 'Notebook product 14',
    'Q26_A_Part_1': 'Cloud Computing platform 1', 
    'Q26_A_Part_2': 'Cloud Computing platform 2', 
    'Q26_A_Part_3': 'Cloud Computing platform 3',     
    'Q26_A_Part_4': 'Cloud Computing platform 4',    
    'Q26_A_Part_5': 'Cloud Computing platform 5',    
    'Q26_A_Part_6': 'Cloud Computing platform 6',    
    'Q26_A_Part_7': 'Cloud Computing platform 7',
    'Q26_A_Part_8': 'Cloud Computing platform 8',    
    'Q26_A_Part_9': 'Cloud Computing platform 9',    
    'Q26_A_Part_10': 'Cloud Computing platform 10',    
    'Q26_A_Part_11': 'Cloud Computing platform 11',
    'Q26_A_OTHER': 'Cloud Computing platform 12',
    'Q12_Part_1': 'Specialised HW 1', 
    'Q12_Part_2': 'Specialised HW 2', 
    'Q12_Part_3': 'Specialised HW 3', 
    'Q12_OTHER': 'Specialised HW 4',     
    'Q13': 'TPU Usage', 
    'Q15': 'ML Methods experience (in years)', 
    'Q23_Part_1': 'Role Important at work 1', 
    'Q23_Part_2': 'Role Important at work 2', 
    'Q23_Part_3': 'Role Important at work 3', 
    'Q23_Part_4': 'Role Important at work 4',
    'Q23_Part_5': 'Role Important at work 5',
    'Q23_Part_6': 'Role Important at work 6',
    'Q23_Part_7': 'Role Important at work 7',
    'Q23_OTHER': 'Role Important at work 8',
    'Q35_A_Part_1': 'Tools to manage ML experiments 1',
    'Q35_A_Part_2': 'Tools to manage ML experiments 2',  
    'Q35_A_Part_3': 'Tools to manage ML experiments 3',   
    'Q35_A_Part_4': 'Tools to manage ML experiments 4',  
    'Q35_A_Part_5': 'Tools to manage ML experiments 5',  
    'Q35_A_Part_6': 'Tools to manage ML experiments 6',  
    'Q35_A_Part_7': 'Tools to manage ML experiments 7',  
    'Q35_A_Part_8': 'Tools to manage ML experiments 8',
    'Q35_A_Part_9': 'Tools to manage ML experiments 9',  
    'Q35_A_Part_10': 'Tools to manage ML experiments 10',      
    'Q35_A_OTHER': 'Tools to manage ML experiments 11',  
    'Q37_Part_1': 'Completed DS courses 1',
    'Q37_Part_2': 'Completed DS courses 2',
    'Q37_Part_3': 'Completed DS courses 3',    
    'Q37_Part_4': 'Completed DS courses 4',
    'Q37_Part_5': 'Completed DS courses 5',
    'Q37_Part_6': 'Completed DS courses 6',    
    'Q37_Part_7': 'Completed DS courses 7',
    'Q37_Part_8': 'Completed DS courses 8',
    'Q37_Part_9': 'Completed DS courses 9',    
    'Q37_Part_10': 'Completed DS courses 10',
    'Q37_Part_11': 'Completed DS courses 11',
    'Q37_OTHER': 'Completed DS courses 12', 
    'Q39_Part_1':'Favourite media sources 1',
    'Q39_Part_2':'Favourite media sources 2',
    'Q39_Part_3':'Favourite media sources 3',
    'Q39_Part_4':'Favourite media sources 4',
    'Q39_Part_5':'Favourite media sources 5',    
    'Q39_Part_6':'Favourite media sources 6',
    'Q39_Part_7':'Favourite media sources 7',
    'Q39_Part_8':'Favourite media sources 8',
    'Q39_Part_9':'Favourite media sources 9',
    'Q39_Part_10':'Favourite media sources 10',
    'Q39_Part_11':'Favourite media sources 11',
    'Q39_OTHER': 'Favourite media sources 12'
}

kaggle_2020 = load_dataset('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv', question_id_to_human_readable_2020)
kaggle_2020['Year'] = 2020
kaggle_2020.head()

In [None]:
columns = ['Time', 'Year', 'Age', 'Gender', 'Country', 'Degree', 'Job Title', 'Company Size', 'Team Size', 
           'ML Status in Company','Compensation Status','Money Spent', 'Current role experience (in years)',
           'Programming language choice', 'Recommend Programming language', 'Coding experience (in years)', 
           'Specialised HW', 'TPU Usage', 'ML Methods experience (in years)']

kaggle_combined = pd.concat([kaggle_2017, kaggle_2018, kaggle_2019, kaggle_2020])
kaggle_combined = kaggle_combined[columns]
kaggle_combined = kaggle_combined.fillna(NOT_AVAILABLE)
print(f'Count before dropping duplicates: {kaggle_combined.shape}')
kaggle_combined = kaggle_combined.sort_values(['Year', 'Country'])
kaggle_combined = kaggle_combined.reset_index(drop=True)
# kaggle_combined = kaggle_combined.drop_duplicates(keep='first')
kaggle_combined.insert(0, 'Unique_Id', kaggle_combined.index.to_list())
print(f'Count after dropping duplicates: {kaggle_combined.shape}')
print(f'Dataset shape: {kaggle_combined.shape}')
kaggle_combined.head()

In [None]:
# del kaggle_2017, kaggle_2018, kaggle_2019, kaggle_2020
# gc.collect()

### Age

In [None]:
def combine_age_2018(row):
    if type(row) == float:
        if (float(row) >= 70.0):
            return '70+'
        if (float(row) < 18.0):
            return '18-21'
        
    if row == '80+':
        return '70+'
    elif row == '70-79':
        return '70+'
    else:
        return row
    
kaggle_combined['Age'] = kaggle_combined['Age'].apply(combine_age_2018)

In [None]:
age_ranges = kaggle_combined['Age'].unique()

def combine_age_2017(row):
    if row.Year == 2017:
        for local_age in age_ranges:
            if type(local_age) != float:
                if (local_age[-1] == '+'):
                    if row.Age >= 70:
                        return '70+'
                else:
                    ranges = local_age.split('-')
                    try:
                        if int(row.Age) >= int(ranges[0]) and int(row.Age) <= int(ranges[1]):
                            return local_age
                    except:
                        return row.Age
    else:
        return row.Age
    
kaggle_combined['Age'] = kaggle_combined.apply(combine_age_2017, axis=1)

In [None]:
kaggle_combined['Age'].value_counts()

### Gender

In [None]:
def change_gender(row):
    if row['Gender'] == 'Man':
        return 'Male'
    elif row['Gender'] == 'Woman':
        return 'Female'
    elif row['Gender'].strip() == 'A different identity':
        return 'Prefer not to say'
    elif row['Gender'].strip() == 'Non-binary, genderqueer, or gender non-conforming':
        return 'Nonbinary'
    else:
        return row['Gender']
    
kaggle_combined['Gender'] = kaggle_combined.apply(change_gender, axis=1)

In [None]:
kaggle_combined['Gender'].value_counts()

### Degree / qualification

In [None]:
def degree_change(row):
    if row.Degree == 'I did not complete any formal education past high school':
        return 'No formal education past high school'
    elif row.Degree == 'Master\'s degree':
        return 'Master’s degree'
    elif row.Degree == 'Bachelor\'s degree':
        return 'Bachelor’s degree'
    elif row.Degree == 'Some college/university study without earning a bachelor\'s degree':
        return 'Some college/university study without earning a bachelor’s degree'
    else:
        return row.Degree
    
kaggle_combined['Degree'] = kaggle_combined.apply(degree_change, axis=1)
kaggle_combined['Degree'].value_counts()

### Company Size

In [None]:
def change_company_size(row):
    if row.Year == 2019:
        if row['Company Size'] == '> 10,000 employees':
            return '10,000 or more employees'   
        else:
            return row['Company Size']

    if row['Company Size'] == '10 to 19 employees' or row['Company Size'] == 'Fewer than 10 employees':
        return '0-49 employees'
    elif row['Company Size'] == '20 to 99 employees' or row['Company Size'] == '100 to 499 employees':
        return '50-249 employees'
    elif row['Company Size'] == '500 to 999 employees':
        return '250-999 employees'
    elif row['Company Size'] == '1,000 to 4,999 employees' or row['Company Size'] == '5,000 to 9,999 employees':
        return '1000-9,999 employees'
    elif row['Company Size'] == '10,000 or more employees':
        return '10,000 or more employees'
    else:
        return row['Company Size']
    
kaggle_combined['Company Size'] = kaggle_combined.apply(change_company_size, axis=1)
kaggle_combined['Company Size'].value_counts()

### Compensation Status

In [None]:
dict_salary_2018_mapping = {'0-10,000':'0-10,000','10-20,000': '10,001-20,000', '20-30,000': '20,001-30,000', '30-40,000':'30,000-39,999',
                           '40-50,000':'40,000-49,999', '50-60,000':'50,000-59,999', '60-70,000':'60,000-69,999',
                           '70-80,000':'70,000-79,999', '80-90,000':'80,000-89,999', '90-100,000':'90,000-99,999',
                           '100-125,000':'100,000-124,999', '125-150,000':'125,000-149,999', '150-200,000': '150,000-199,999',
                           '200-250,000':'200,000-249,999', '250-300,000': '250,000-299,999', '300-400,000':'300,000-500,000',
                           '400-500,000':'300,000-500,000','500,000+':'> $500,000', 'I do not wish to disclose my approximate yearly compensation':'Cant Disclose',
                           NOT_AVAILABLE: NOT_AVAILABLE, np.nan:np.nan}

def change_salary(row):
    if row.Year == 2019 or row.Year == 2020:
        if row['Compensation Status']=='$0-999' or row['Compensation Status'] == '1,000-1,999' or row['Compensation Status'] == '2,000-2,999' \
            or row['Compensation Status']=='3,000-3,999' or row['Compensation Status']=='4,000-4,999' or row['Compensation Status']=='5,000-7,499' or row['Compensation Status']=='7,500-9,999':
            return '0-10,000'
        elif row['Compensation Status'] == '10,000-14,999' or row['Compensation Status'] == '15,000-19,999':
            return '10,001-20,000'
        elif row['Compensation Status'] == '20,000-24,999' or row['Compensation Status'] == '25,000-29,999':
            return '20,001-30,000'
        else:
            return row['Compensation Status']

    elif row.Year == 2018:
        #if not row['Compensation Status'].isna():
        value_to_return = dict_salary_2018_mapping[row['Compensation Status']]
        return value_to_return
    else:
        return row['Compensation Status']
    
kaggle_combined['Compensation Status'] = kaggle_combined.apply(change_salary, axis=1)

list_values = list(dict_salary_2018_mapping.values())
list_values.remove(np.nan)

def change_salary_2017(row):
    if row['Year'] == 2017:
        for i in list_values:
            ranges = i.split('-')
            if len(ranges)==2:
                try:
                    if int(row['Compensation Status'].replace(',',''))>=int(ranges[0].replace(',','')) and int(row['Compensation Status'].replace(',','')) <= int(ranges[1].replace(',','')):
                        return i
                except:
                    return 'Cant Disclose'
            else:
                try:
                    if int(row['Compensation Status'].replace(',',''))>500000:
                        return '> $500,000'
                    else:
                        return 'Cant Disclose'
                except:
                    return 'Cant Disclose'
                #> 5,000,000, can't disclose
                
    else:
        return row['Compensation Status']

In [None]:
kaggle_combined['Compensation Status'] = kaggle_combined.apply(change_salary_2017, axis=1)
kaggle_combined['Compensation Status'].unique()

### Country

In [None]:
consistent_country_name = {
    'United States of America': 'United States', 
    'Republic of China': 'China',
    'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom',
    "People 's Republic of China": 'China', 
    'Republic of Korea': 'South Korea',
    'Hong Kong (S.A.R.)': 'Hong Kong',
    'Iran, Islamic Republic of...': 'Iran',
    'Viet Nam': 'Vietnam'
}
kaggle_combined['Country'] = kaggle_combined['Country'].apply(lambda x: x.strip())
kaggle_combined = kaggle_combined.replace({'Country': consistent_country_name})
kaggle_combined = kaggle_combined.replace({'Country': {'I do not wish to disclose my location': NOT_AVAILABLE}})
kaggle_combined = kaggle_combined.replace({'Country': {'I do not wish to disclose my location': NOT_AVAILABLE}})
kaggle_combined = strip_spaces_from_all_columns(kaggle_combined)

In [None]:
countries = [country[0] for country in dict(kaggle_combined[['Country']].value_counts()).keys()]
countries

In [None]:
print(f'Dataset shape: {kaggle_combined.shape}')
kaggle_combined.to_csv(f'{PREPROCESSED_DATASET_UPLOAD_FOLDER}/kaggle_2017_to_2020.csv', index=False)

In [None]:
missing_countries_in_2020 = ['New Zealand', 'Denmark', 'Finland', 'Norway', 'UK', 'Czech Republic', 'Hungary',
                            'Austria', 'Norway', 'Algeria']

### Comparing combined dataset with individual tables and combined counts from them

In [None]:
total_rows = kaggle_2017.shape[0] + kaggle_2018.shape[0] + kaggle_2019.shape[0] + kaggle_2020.shape[0] 
print(f"Individual tables: total: {total_rows}, \n  kaggle_2017: {kaggle_2017.shape[0]},   \n  kaggle_2018: {kaggle_2018.shape[0]}, " \
      f"\n  kaggle_2019: {kaggle_2019.shape[0]},  \n  kaggle_2020: {kaggle_2020.shape[0]}")
print(f"Combined table: kaggle_combined: {kaggle_combined.shape[0]}")

### Comparing combined dataset with a reference dataset for differences and discrepancies

In [None]:
total_rows = kaggle_2017.shape[0] + kaggle_2018.shape[0] + kaggle_2019.shape[0] + kaggle_2020.shape[0] 
print(f"Individual tables: total: {total_rows}, \n  kaggle_2017: {kaggle_2017.shape[0]},   \n  kaggle_2018: {kaggle_2018.shape[0]}, " \
      f"\n  kaggle_2019: {kaggle_2019.shape[0]},  \n  kaggle_2020: {kaggle_2020.shape[0]}")
print(f"Combined table: kaggle_combined: {kaggle_combined.shape[0]}")
print(f"Reference table: kaggle_questions_and_responses_ref_df: {kaggle_questions_and_responses_ref_df.shape[0]}")

In [None]:
print(kaggle_questions_and_responses_ref_df.columns)
print(kaggle_combined.columns)

In [None]:
diff_df = dataframe_difference(
    kaggle_questions_and_responses_ref_df,
    kaggle_combined[kaggle_questions_and_responses_ref_df.columns]
)

In [None]:
print(f"Combined table: kaggle_combined: {kaggle_combined.shape[0]}\n")
print(diff_df['_merge'].value_counts(), "\n")
print(f"Differences table: diff_df: {diff_df.shape[0]}\n")
print("Total of merged table differences:", sum(diff_df['_merge'].value_counts()))

It is possible the reference table and the newly combined tables differ for good reasons, and could be checked later on, why.

### Checking for missing data (by country and/or year)

In [None]:
filter_missing_countries = (kaggle_combined['Country'].isin(missing_countries_in_2020)) & (kaggle_combined['Year'] != 2020)
kaggle_combined[filter_missing_countries]

In [None]:
filter_missing_countries = (kaggle_combined['Country'].isin(missing_countries_in_2020)) & (kaggle_combined['Year'] == 2020)
kaggle_combined[filter_missing_countries]

## Combining Kaggle dataset with Countries and Continents dataset

In [None]:
print(f'Before combining kaggle_combined.shape: {kaggle_combined.shape}')
country_and_continent_info = pd.read_csv('../input/world-bank-data-1960-to-2016-extended/Countries_and_continents_of_the_world.csv')
print(f'   Before dropping duplicates from country_and_continent_info: {country_and_continent_info.shape}')
country_and_continent_info = country_and_continent_info.rename(columns={'Country Name': 'Country'})
country_and_continent_info = country_and_continent_info.drop_duplicates(subset = ['Country'], keep='first')
print(f'   After dropping duplicates from country_and_continent_info: {country_and_continent_info.shape}')
country_and_continent_info[['Region', 'Continent', 'Country']] = \
                                    country_and_continent_info[['Region', 'Continent', 'Country']].fillna(NOT_AVAILABLE)
country_and_continent_info = country_and_continent_info.reset_index(drop=True)
country_and_continent_info = strip_spaces_from_all_columns(country_and_continent_info)

print()
print('    --- country_and_continent_info combined with kaggle_combined into kaggle_combined_country_and_continents ---')
print()

kaggle_combined_country_and_continents = kaggle_combined.merge(country_and_continent_info, how='left', on='Country', indicator=True)
print("We won't be removing duplicates for the reason that they end up\n")
# print(f'   Before dropping duplicates from kaggle_combined_country_and_continents: {kaggle_combined_country_and_continents.shape}')
# kaggle_combined_country_and_continents = kaggle_combined_country_and_continents.drop_duplicates(keep='first')
# print(f'   After dropping duplicates from kaggle_combined_country_and_continents: {kaggle_combined_country_and_continents.shape}')
kaggle_combined_country_and_continents[['Region', 'Continent', 'Country']] = \
                                    kaggle_combined_country_and_continents[['Region', 'Continent', 'Country']].fillna(NOT_AVAILABLE)
kaggle_combined_country_and_continents = strip_spaces_from_all_columns(kaggle_combined_country_and_continents)
kaggle_combined_country_and_continents = kaggle_combined_country_and_continents.reset_index(drop=True)
kaggle_combined_country_and_continents = strip_spaces_from_all_columns(kaggle_combined_country_and_continents)
print(f'After combining kaggle_combined.shape: {kaggle_combined_country_and_continents.shape}')

In [None]:
country_and_continent_info['Region'] = country_and_continent_info['Region'].apply(lambda x: x.strip())

In [None]:
country_and_continent_info['Region'].value_counts()

In [None]:
country_and_continent_info['Continent'].value_counts()

In [None]:
def exclude_columns_in_df(dataframe: pd.DataFrame, columns_to_exclude: list = []) -> pd.DataFrame:
    current_columns = dataframe.columns
    excluded_columns = list(set(current_columns) - set(columns_to_exclude))
    return dataframe[excluded_columns]

In [None]:
sorted_df1 = exclude_columns_in_df(kaggle_combined, ['_merge']).sort_values(by=['Year', 'Country']) 
sorted_df2 = exclude_columns_in_df(kaggle_combined_country_and_continents[kaggle_combined.columns], ['_merge']).sort_values(by=['Year', 'Country'])
diff_df1 = dataframe_difference(sorted_df1, sorted_df2[sorted_df1.columns])
diff_df2 = dataframe_difference(sorted_df2[sorted_df1.columns], sorted_df1)

In [None]:
print(f'diff_df1: {diff_df1.shape[0]} diff_df2: {diff_df2.shape[0]}')
print(f'kaggle_combined: {kaggle_combined.shape[0]}, ' \
      f'kaggle_combined_country_and_continents: {kaggle_combined_country_and_continents.shape[0]}')

In [None]:
diff_indices = set(kaggle_combined.index) - set(kaggle_combined_country_and_continents.index)
len(diff_indices)

In [None]:
from pandas.testing import assert_frame_equal

try:
    assert_frame_equal(kaggle_combined, kaggle_combined_country_and_continents[kaggle_combined.columns])
except Exception as ex:
    print(ex)

In [None]:
import os
!mkdir -p /kaggle/working/tmp
os.environ['sorted_df1_filename'] = '/kaggle/working/tmp/kaggle_combined.csv'
os.environ['sorted_df2_filename'] = '/kaggle/working/tmp/kaggle_combined_country_and_continents.csv'
sorted_df1_filename = os.environ['sorted_df1_filename']
sorted_df2_filename = os.environ['sorted_df2_filename']
sorted_df1.to_csv(sorted_df1_filename, index=False)
sorted_df2[sorted_df1.columns].to_csv(sorted_df2_filename, index=False)
!head -n 1 $sorted_df1_filename
! echo ""
!head -n 1 $sorted_df2_filename
! echo ""
!diff --suppress-common-lines -y $sorted_df1_filename $sorted_df2_filename
print("Download the two .csv files and compare them using diff (just like the above command)")

Doing the above, seems like an old-school way to compare datasets but it gives a better idea why there is a difference between the datasets, so far it seems semantically the datasets are different but if we have to look at it literally they differ a bit. Comparisons and analysis further on will show the differences.

In [None]:
original_index = sorted(kaggle_combined.index.to_list())
index_after_merge = sorted(kaggle_combined_country_and_continents.index.to_list())
print(kaggle_combined.shape[0], kaggle_combined_country_and_continents.shape[0], 
      kaggle_combined.shape[0] - kaggle_combined_country_and_continents.shape[0], 
      len(set(original_index) - set(index_after_merge)))

All the shape (row count) functions return a mismatch although the `merge()` and `concat()` (after dropping duplicates) do not seem to find

In [None]:
from fuzzywuzzy import fuzz
FILTER_COLUMN = 'Country'

central_countries_list = list(set(country_and_continent_info[FILTER_COLUMN].values))
print(f'No of countries in the "country_and_continent_info" table: {len(central_countries_list)}')

countries_active_on_kaggle = list(set(kaggle_combined_country_and_continents[FILTER_COLUMN].values))
print(f'No of countries active on Kaggle ("kaggle_combined_country_and_continents" table): {len(countries_active_on_kaggle)}')

countries_active_on_kaggle = list(set(kaggle_combined[FILTER_COLUMN].values))
print(f'No of countries active on Kaggle ("kaggle_combined" table): {len(countries_active_on_kaggle)}')

print()
filter_not_active_on_kaggle = ~country_and_continent_info[FILTER_COLUMN].isin(countries_active_on_kaggle)
countries_not_active_on_kaggle = list(set(country_and_continent_info[filter_not_active_on_kaggle][FILTER_COLUMN].values))
print(f'No of countries NOT active on Kaggle: {len(countries_not_active_on_kaggle)}')

found_pairs = {}
for not_an_active_country in countries_not_active_on_kaggle:
    for active_country in countries_active_on_kaggle:
        ratio = fuzz.token_sort_ratio(not_an_active_country, active_country)
        if ratio >= 70:
            found_pairs.update({not_an_active_country: active_country})

print(f'No of countries that match (between countries_active_on_kaggle and countries_not_active_on_kaggle) due to fuzzy similarity: {len(found_pairs)}')

country_and_continent_info['active_on_kaggle'] = 1
country_and_continent_info.loc[filter_not_active_on_kaggle, 'active_on_kaggle'] = 0
active_filter = country_and_continent_info['active_on_kaggle'] == 1
print()
print(f'No of countries NOT active on Kaggle: {country_and_continent_info[~active_filter].shape[0]}')
print(f'No of countries active on Kaggle: {country_and_continent_info[active_filter].shape[0]}')

In [None]:
country_and_continent_info['Region'].value_counts()

In [None]:
country_and_continent_info['Continent'].value_counts()

In [None]:
display(country_and_continent_info[country_and_continent_info['Country Code'].isna()])
display(country_and_continent_info[['Country', 'Country Code']])

In [None]:
country_and_continent_info.to_csv(f'{PREPROCESSED_DATASET_UPLOAD_FOLDER}/country_and_continent_info.csv', index=False)

### Check for correctness after merging the datasets

In [None]:
countries_check_df = pd.concat([kaggle_combined['Country'].value_counts(), 
                                kaggle_combined_country_and_continents['Country'].value_counts()], axis=1)
countries_check_df.columns=['Before merging count', 'After merging count']
countries_check_df['Difference'] = countries_check_df['After merging count'] - countries_check_df['Before merging count']
filter_differences = countries_check_df['Difference'] != 0
print(f"Sum of all the differences: {abs(countries_check_df['Difference'].sum())}")
countries_check_df[filter_differences]

#### Checks on Turkey data

In [None]:
filter_turkey_kc = (kaggle_combined['Country'] == 'Turkey')
filter_turkey_kccc = (kaggle_combined_country_and_continents['Country'] == 'Turkey')

In [None]:
print(
    "kaggle_combined row count:", kaggle_combined[filter_turkey_kc].shape[0], 
    "\nkaggle_combined_country_and_continents row count:", kaggle_combined_country_and_continents[filter_turkey_kccc][kaggle_combined.columns].shape[0]
)
print()
print("With limited (as that of kaggle_combined) columns (after dropping duplicates):", kaggle_combined_country_and_continents.loc[filter_turkey_kccc, kaggle_combined.columns].drop_duplicates(keep='first').shape)
print("With all columns (after dropping duplicates):", kaggle_combined_country_and_continents.loc[filter_turkey_kccc].drop_duplicates(keep='first').shape)

From the above we can say more or less that all the new data seems to be good, the mismatch in row counts arise when we try to drop duplicates (especially when we scope the columns to only that of the `kaggle_combined` dataset.

#### Checks on Hong Kong data

In [None]:
filter_extra_HK_2020_data = (kaggle_combined['Country'] == 'Hong Kong') & (kaggle_combined['Year'] != 2020) 
print('Before merging countries')
print('   - Original count:', kaggle_combined[filter_extra_HK_2020_data].shape)
print('   - Duplicates dropped count:', kaggle_combined[filter_extra_HK_2020_data].drop_duplicates().shape)
print('After merging countries')
filter_extra_HK_2020_data = (kaggle_combined_country_and_continents['Country'] == 'Hong Kong') & (kaggle_combined_country_and_continents['Year'] != 2020) 
print('   - Duplicates dropped count:', kaggle_combined_country_and_continents[filter_extra_HK_2020_data].shape)

In [None]:
filter_HK_2020_data = (kaggle_combined['Country'] == 'Hong Kong') & (kaggle_combined['Year'] == 2020) 
kaggle_combined[filter_HK_2020_data]

#### We can confirm that there is no data for **Hong Kong** in **2020**. Also there are no duplicates present.

### Let's check Denmark, Canada, HK or another country

In [None]:
country_to_check = 'Finland'
filter_country_c = kaggle_combined['Country'] == country_to_check
filter_country_cc = kaggle_combined_country_and_continents['Country'] == country_to_check

In [None]:
a = kaggle_combined[filter_country_c].sort_index().reset_index(drop=True)
b = kaggle_combined_country_and_continents[filter_country_cc][kaggle_combined.columns].sort_index().reset_index(drop=True)
dataframe_difference(a, b, 'both')
print(a['Year'].value_counts(), b['Year'].value_counts())

In [None]:
kaggle_combined_country_and_continents.to_csv(
    f'{PREPROCESSED_DATASET_UPLOAD_FOLDER}/kaggle_2017_to_2020_and_countries.csv', index=False
)

## Survey response v/s total Kaggle members stats

In [None]:
def get_users_count_upto_end_of(year: int) -> int:
    filter_year = (kaggle_users['RegisterDate'] <= f'12/31/{year}') ## MM/DD/YYYY date format (US format)
    return kaggle_users[filter_year]['RegisterDate'].shape[0]

In [None]:
%%time
force_generate = True
if force_generate:
    kaggle_users = pd.read_csv('../input/meta-kaggle/Users.csv')

In [None]:
%%time
kaggle_members_count_stats = {
    'Year': ['2020', '2019', '2018', '2017'], 
    'Total_members_count': [get_users_count_upto_end_of(2020), get_users_count_upto_end_of(2019), 
              get_users_count_upto_end_of(2018), get_users_count_upto_end_of(2017)]
}

kaggle_members_count_stats

In [None]:
del kaggle_users
gc.collect()

In [None]:
def get_response_count_for(year: int) -> int:
    filter_year = preprocessed_kaggle_combined['Year'] == year
    return preprocessed_kaggle_combined[filter_year].shape[0]

In [None]:
preprocessed_kaggle_combined = pd.read_csv('../input/kaggle-machine-learning-data-science-survey-ext/preprocessed-kaggle-2017-to-2020/kaggle_2017_to_2020.csv')
kaggle_response_count_stats = {
    'Year': ['2020', '2019', '2018', '2017'], 
    'Total_responses_count': [get_response_count_for(2020), get_response_count_for(2019), 
              get_response_count_for(2018), get_response_count_for(2017)]
}

kaggle_response_count_stats

In [None]:
survey_response_stats_df = pd.DataFrame(kaggle_response_count_stats).merge(
    pd.DataFrame(kaggle_members_count_stats), 
    how ='inner', on='Year'
)
survey_response_stats_df = survey_response_stats_df.sort_values(by='Year').reset_index(drop=True)
survey_response_stats_df['Members_to_response_ratio'] = survey_response_stats_df['Total_responses_count'] / survey_response_stats_df['Total_members_count']

columns = list(set(survey_response_stats_df.columns) - set(['Year']))

for each_column in columns:
    if 'pct_change' not in each_column:
        new_column = f'{each_column}_pct_change'
        survey_response_stats_df[new_column] = survey_response_stats_df[each_column].pct_change()
survey_response_stats_df

In [None]:
survey_response_stats_df.to_csv(f'{PREPROCESSED_DATASET_UPLOAD_FOLDER}/survey_response_stats.csv', index=False)

### Zipping the respective directories

In [None]:
%%bash
UPLOAD_FOLDER="/kaggle/working/upload"
PREPROCESSED_UPLOAD_FOLDER='/kaggle/working/upload/preprocessed-kaggle-2017-to-2020'

echo "~~~ Zipping folder: ${PREPROCESSED_UPLOAD_FOLDER}"
cd "${PREPROCESSED_UPLOAD_FOLDER}"
zip -r9 ${PREPROCESSED_UPLOAD_FOLDER}.zip *
echo "~~~ ${PREPROCESSED_UPLOAD_FOLDER}.zip ready to be used/copied."
cd ${UPLOAD_FOLDER}
ls -lash *.zip 
echo ""

## Uploading newly created/updated csv to your Kaggle Dataset

Setup your local environment with your Kaggle login details (`KAGGLE_KEY` and `KAGGLE_USERNAME`).

In [None]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

import os
os.environ['KAGGLE_KEY'] = user_secrets.get_secret("KAGGLE_KEY")
os.environ['KAGGLE_USERNAME'] = user_secrets.get_secret("KAGGLE_USERNAME")

Using the `kaggle` Python client login, into your account from within the kernel.

In [None]:
import kaggle
kaggle.api.authenticate()

Get the metadata for the dataset you have already created manually - it's best to manually create it and upload the initial csv file(s) into it, to avoid subsequent issues with updating the dataset (as seen during my own end-to-end cycle).

Save the metadata file as a json file but before that, add/update two keys id and id_no with the respective details as shown below and then save it.

In [None]:
OWNER_SLUG='neomatrix369'
DATASET_SLUG='kaggle-machine-learning-data-science-survey-ext'
dataset_metadata = kaggle.api.metadata_get(OWNER_SLUG, DATASET_SLUG)
dataset_metadata['id'] = dataset_metadata["ownerUser"] + "/" + dataset_metadata['datasetSlug']
dataset_metadata['id_no'] = dataset_metadata['datasetId']
import json
with open(f'{DATASET_UPLOAD_FOLDER}/dataset-metadata.json', 'w') as file:
    json.dump(dataset_metadata, file, indent=4)

Finally call the dataset_create_version() api and pass it the folder where the metadata file exists and also where your .csv and .fth file(s) - those file(s) that you would like to upload into your existing Dataset (as a new version).

In [None]:
%%time
# !kaggle datasets version -m "Updating datasets" -p /kaggle/working/upload
kaggle.api.dataset_create_version(DATASET_UPLOAD_FOLDER, 'Updating datasets')

### Cleanup 

In [None]:
!rm -fr /kaggle/working/upload

### Thanks / Credits

- [sahilmaheshwari](https://www.kaggle.com/sahilmaheshwari/) (https://www.kaggle.com/thedatabeast/cleaned-mcr-kaggle-survey-2019)
- [rblcoder](https://www.kaggle.com/rblcoder/) (https://www.kaggle.com/aitzaz/stack-overflow-developer-survey-2020)
- [harveenchadha](https://www.kaggle.com/harveenchadha) (https://www.kaggle.com/harveenchadha/kaggle-survey-20172020-merged-data) - special thanks for your data preparatory kernel (https://www.kaggle.com/harveenchadha/merging-all-historical-survey-data-2017-2020), I have reused a number of things from it

For sharing a number of additional survey/country/region/world related datasets.

### Prequels/sequels

- **Kaggle Machine Learning & Data Science (data-prep)** | [Extended Dataset](https://www.kaggle.com/neomatrix369/kaggle-machine-learning-data-science-survey-ext) | [Additional Dataset](https://www.kaggle.com/neomatrix369/world-bank-data-1960-to-2016-extended)
- [Kaggle Global Outreach (analysis)](https://www.kaggle.com/neomatrix369/kaggle-global-outreach-analysis/)