# Kaggle 2020 Survey on State of ML and Data Science
## Latinas on ML and data science
### By: Myrna M Figueroa Lopez      

**Purpose**: show how to access the pdf and csv data, explore the competition data, use different visualization techniques, and discuss the responses of women residing in Latin America who participated in the Kaggle 2020 survey. 

**Limitations**:This presentation is limited to the competition data, and to the opinion of those who participated. While I may use the term *latina*, this may not reflect the identity of the respondent: some may reside in Latin America and not be latina. Also latinas may live in other nations listed, or not, in the survey. Any numeric conclusion here has not been through a proper quantitative analysis, therefore, it is based on the qualitative observations I made. Also, I limited visibility of some dataframes for space.  

**Observation of general survey results**: Participants of the Kaggle 2020 survey were mostly young (25-29) male students from India with Master's degrees and about 3-5 years of experience in coding. According to the survey, participants use Python the most, followed by SQL, R, and C++, and most recommend Python to biginners.      

Most participants identified Jupyter as the integrated development environments (IDE's) they mostly use, and Colab as the hosted-notebook product they mostly use. Kaggle came in second as hosted-notebook. Matplotlib data visualization library is the most used library, and regressions is the most popular Machine Learning algorithm among participants. Interestingly, the majority of participants stated that they have never used a TPU (tensor processing unit).    

Respondents described their place of work or business as one with 1 or 2 people responsible for the data science workload. Most identified *analysis and understanding data* as their most important work role. Most respondents identified Coursera as the platform where they begun learning data scienceand MySQL as the big data product they use the most. Most respondants share publicly share or deploy their data analysis or machine learning applications in GitHub, followed by not sharing, and then in Kaggle. 

**Micro-results**: Like the general group, Python, SQL and R are the most popular programming languages among women coders residing in Latin America. They also recommend Python to beginners and identified Jupyter as the most-used integrated development environments (IDE's). They identified Colab as the notebook they use the most, followed by none, and in then Kaggle notebook. The majority of these women also stated to have no experience with TPU, and chose Matplotlib as the library they use the most. 

Female coders residing in Latin America use the Scikit-learn machine learning framework the most. Most of these women said that their employer does not incorporate machine learning in their place of work. Like the group-at-large, they identified the *analysis and understanding of data* as their main workplace role. Among this group of women, Google Cloud Platform, Amazon Web Services (AWS), and Microsoft Azure are the cloud computing platform of choice, in that order. Similar to the overall participants, most of these women prefer to share their projects in GitHub the most. These women identified Coursera as the platform they first learned encountered data science, followed by Udemy and then, Kaggle. Women coders living in Latin America would like to learn MySQL in the near future. 

**Conclusion**:The main reason for the pdf-related lines of code was to demonstrate ways to access pdf files when presented with them. I used the overall survey data to explore the overall opinion of participants. I focused on a group of people (women residing in Latin America) to observe and compare their answers to that of the general group. I found no significant difference in their responses, other than the years of experience with coding. Future opportunities include exploring other angles that I did not explore here. Also, a future quantitative analysis and a peer review of this notebook would be ideal.
 

In [None]:
# Python 3 environment as defined by the kaggle/python Docker 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O 

#Obtain competition data
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


### **Accessing the data: PDFs in Python**

In [None]:
#First, install python library for working with PDFs
!pip install PyPDF2

In [None]:
#Second, import the modules
import PyPDF2

#Then, extract text from PDF

#Start by creating a pdf file object for one of the sources
pdfFileObj = open('../input/kaggle-survey-2020/supplementary_data/kaggle_survey_2020_methodology.pdf', 'rb') 

# Then, create a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

# Check the number of pages in the file
print(pdfReader.numPages) 
  
# create a page object 
pageObj = pdfReader.getPage(0) 
  
# extract text from page 
print(pageObj.extractText()) 
  
# End by closing the pdf file object 
pdfFileObj.close() 

In [None]:
#Repeat for the other PDF source

pdfFileObj2 = open('../input/kaggle-survey-2020/supplementary_data/kaggle_survey_2020_answer_choices.pdf', 'rb') 
# creating a pdf reader object 
pdfReader2 = PyPDF2.PdfFileReader(pdfFileObj2) 

# number of pages in pdf 
print(pdfReader2.numPages) 
  
# creating a page object 
pageObj2 = pdfReader.getPage(0) 
  
# extracting text from page 
print(pageObj2.extractText()) 
  
# closing the pdf file object 
pdfFileObj2.close() 

The PDF files may contain more data for exploration (One file has 3 pages and the other 20). However, the focus of analysis for this notebook is the data in the CSV file. The main idea here was to demonstrate ways to access pdf files when presented with them.

### **Accessing the data: CSV file**

In [None]:
#First, open CSV file as a pandas dataframe
df = pd.read_csv('../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv')
#visualize the first 3 rows of the dataframe
df.head(3)

The dataframe above is large, with empty entries (NaNs) in certain rows and columns.   
20036 participants answered several questions related to data science, preferences, experience, and demographics.

### Explore the data: Macro-level observations
To make the dataframe easier to follow, rename the columns with the entries in row 0.   
Then, remove row 0 to avoid confusion.   

In [None]:
#rename column with data in row 0
df=df.rename(columns=df.iloc[0])

#remove the 1st row
df = df.iloc[1:]
#numeric description of the survey answers
df.describe()

Kaggle 2020 survey participants were mostly young (25-29) male students from India with Master's degrees and about 3-5 years of experience in coding.   
The data in the csv consists of several columns not easily seen.    
I split the dataframe into several smaller ones to analyze the survey answers.

In [None]:
Demographics=df.iloc[:,:-348] 
questions=df.iloc[:,7:356]

Above, I separated the questions dataframe into 28 smaller dataframes.   
Each of this represent a group of related questions for easier interpretation.

In [None]:
Languages=questions.iloc[:,0:13]
PrgL=Languages.describe()
#renaming columns for ease of visualization
PrgL.columns=["Python","R","SQL",'C','C++','Java', 'Javascript','Julia', 'Swift', 'Bash', 'MATLAB', 'None', 'Other' ]
PrgL

In [None]:
#check the datatype of the dataframe
PrgL.dtypes

In [None]:
#create a list
Programs=PrgL.iloc[0]

#convert to dataframe
Programs=pd.DataFrame(Programs)
Programs

In [None]:
#change datatype from object to integer

Programs["count"] = Programs["count"].astype(str).astype(int)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt #plotting, math, stats
%matplotlib inline

plt.figure(figsize=(10,7))
Programs.plot(kind='bar', color="teal")

On a regular basis, survey participants use Python the most, followed by SQL, R, and C++. 

In [None]:
Env=questions.iloc[:,13:26]

#visualize the description of the dataframe using the code:
## Env.describe()

Most respondents recommend Python to coding biginners.   
Also, most identified Jupyter as  the integrated development environments (IDE's) they mostly use. 

In [None]:
Notebook=questions.iloc[:,26:40]
#visualize the description of the dataframe using the code:
## Notebook.describe()

Respondents identified Colab as the top hosted notebook products they mostly use. Kaggle was the second most used. 

In [None]:
gpus=questions.iloc[:,40:45]
#visualize the description of the dataframe using the code:
## gpus.describe()

Most participants use a pc or laptop for their data science projects. Those that use specialized hardware mostly choose GPUs.

In [None]:
libraries=questions.iloc[:,45:58]
libraries.describe()

The majority of participants stated to have never used a TPU (tensor processing unit). Also, most use Matplotlib data visualization libraries or tools in their projects.

In [None]:
MLs=questions.iloc[:,58:75]
MLs.describe()

Most respondents have less than 1 year using machine learning (ML) methods. Scikit-learn is the ML framework most used. 

In [None]:
algs=questions.iloc[:,75:87]
algs.describe()

Respondents chose Linear or Logistic Regression as the most used ML algorithm. 

In [None]:
vision=questions.iloc[:,87:94]
#visualize the description of the dataframe using the code:
## vision.describe()

Most respondents identified image classification and other general purpose networks (VGG, Inception, ResNet, ResNeXt, NASNet, EfficientNet, etc.) as the computer vision methods they use regularly. 

In [None]:
NLP=questions.iloc[:,94:100]
#visualize the description of the dataframe using the code:
## NLP.describe()

Respondents identified word embeddings/vectors (GLoVe, fastText, word2vec) as the natural language processing (NLP) method they use most.

In [None]:
Work=questions.iloc[:,101:111]
Work.describe()

Respondents described their place of work or business as a place with 1 or 2 responsible for data science workloads, where most consider apply ML methods into their work. Most identified analysis and understanding data to influencebusiness decisions as the most important part of their role at work.

In [None]:
Income=questions.iloc[:,111:113]
#visualize the description of the dataframe using the code:
## Income.describe()

In [None]:
cloud=questions.iloc[:,113:125]
cloud.describe()

Survey participants stated that the cloud computing platform they mostly use is Amazon Web Services (AWS), follwed by Google Cloud Platform (GCP).

In [None]:
clo=questions.iloc[:,125:137]
#visualize the description of the dataframe using the code:
## clo.describe()

m=questions.iloc[:,137:148]
#visualize the description of the dataframe using the code:
## m.describe()

In [None]:
big=questions.iloc[:,148:167]
big.describe()

Most identified MySQL as the big data product (relational database, data warehouse, data lake, or similar) they use most.

In [None]:
bi=questions.iloc[:,167:183]
#visualize the description of the dataframe using the code:
## bi.describe()

Most participants stated not using business intelligence tools regularly.

In [None]:
automated=questions.iloc[:,183:203]
#visualize the description of the dataframe using the code:
## automated.describe()

tool=questions.iloc[:,203:214]
#visualize the description of the dataframe using the code:
## tool.describe()

In [None]:
public=questions.iloc[:,214:224]
p1=public.describe()
p1

Most respondants do not share publicly share or deploy your data analysis or machine learning applications. Of those that do, most do it in GitHub, and then Kaggle.

In [None]:
platforms=questions.iloc[:,224:236]
platforms.describe()

Most respondents identified Coursera as the platform where they begun or completed data science courses.   
Below, you will see other answers to the survey.

In [None]:
media=questions.iloc[:,236:249]
#visualize the description of the dataframe using the code:
## media.describe()

In [None]:
futurePlat=questions.iloc[:,249:261]
#visualize the description of the dataframe using the code:
## futurePlat.describe()

In [None]:
futureCl=questions.iloc[:,261:273]
#visualize the description of the dataframe using the code:
## futureCl.describe()

In [None]:
futureML=questions.iloc[:,273:284]
#visualize the description of the dataframe using the code:
## futureML.describe()

In [None]:
futureBig=questions.iloc[:,284:302]
futureBig.describe()

In [None]:
futureB=questions.iloc[:,302:317]
#visualize the description of the dataframe using the code:
## futureB.describe()

futureT=questions.iloc[:,317:337]
#visualize the description of the dataframe using the code:
## futureT.describe()

futureManaging=questions.iloc[:,337:356]
#visualize the description of the dataframe using the code:
## futureManaging.describe()

### Demographics

In [None]:
#dropping the 1st column
Demo=Demographics.drop('Duration (in seconds)', axis=1)

In [None]:
#finding the unique answer choices in a specific column
Demo['What is your gender? - Selected Choice'].unique()

In [None]:
Demo['For how many years have you been writing code and/or programming?'].unique()

In [None]:
Demo['Select the title most similar to your current role (or most recent title if retired): - Selected Choice'].unique()

In [None]:
Demo['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?'].unique()

In [None]:
#visualize gender labels in survey
import seaborn as sns
import matplotlib.pyplot as plt #plotting, math, stats
%matplotlib inline

plt.figure(figsize=(10,5))
sns.countplot(Demo['What is your gender? - Selected Choice'])

In [None]:
#Nationality
##the specific labels for countries related to the survey
Demo['In which country do you currently reside?'].unique()

In [None]:
#visualizing the frequency in which people
#from those nations participated in the survey
plt.figure(figsize=(12,7))
plt.xticks(rotation=90) 
sns.countplot(Demo['In which country do you currently reside?'])

In [None]:
#getting the count of participants per each country listed
Demo['In which country do you currently reside?'].value_counts()

In [None]:
#Identifying the Number Of Missing Data (NaNs)
Demo.isnull().sum()

## Female Participants Residing in Latin America

In [None]:
#Make a new dataframe with those who reside in Latin American countries
subdf = df[df['In which country do you currently reside?'].isin(['Chile', 'Brazil', 'Mexico', 'Colombia', 'Peru', 'Argentina'])]
subdf.describe()

Most of those residing in Latin America, who participated in the Kaggle 2020 survey are young males living in Brazil with 3-5 years of coding experience.

### Female respondents who reside in Latin America

In [None]:
ladies=subdf[subdf['What is your gender? - Selected Choice'].isin(['Woman'])]
ladies.describe()

190 women who live in Latin American countries participated in the Kaggle 2020 survey. Most reside in Brazil, have a Master's degree, but less than a year of experience coding.

In [None]:
#create a df using the ladies demographic
Demgrph=ladies.iloc[:,1:-348] 
#renaming columns 
Demgrph.columns=["Age group","Gender identity","Country of residence",'Education','Title','Experience' ]
Demgrph.head(3)

In [None]:
#identify if NaNs are present in any column
Demgrph.isnull().sum()

In [None]:
Demgrph['Experience'].unique()

In [None]:
#create a series of value count of female respondents per country
s=Demgrph['Country of residence'].value_counts(dropna=False)
s

In [None]:
# Create a pie chart

s.plot.pie(label="", title="Country", figsize=(5, 5)); 
plt.show(block=True);           


In [None]:
#turning a series into a dataframe
totals = pd.DataFrame(s) 
totals

In [None]:
#Create a df with country coordinates 
#to visualize in a map

data = {'Country of residence':  ['Argentina', 'Brazil','Colombia', 'Chile', 'Mexico', 'Peru'],
        'Latitude': ['-38.4193', '-23.533773','4.624335','-33.447487','19.432608','-12.046374'],
         'Longitude': ['-63.5989', '-46.625290','-74.063644','-70.673676','-99.133209','-77.042793']
        }

coord = pd.DataFrame (data, columns = ['Country of residence','Latitude','Longitude'])
coord

In [None]:
# Declare a list that is to be converted into a column 
total = ['93', '33', '21', '20','15', '8'] 
   
coord['Total ladies'] = total 
  
# Observe the result 
coord

In [None]:
#importing library for plotting maps
import folium

In [None]:
#Latin America
#using Colombia as central point
LatinAmerica = folium.Map(location=[4.624335, -74.063644],
                   zoom_start = 3)

In [None]:
#On a map, show the total respondents who
#identify as woman who live in a Latin America

#using a loop to get coordinates
for index, row in coord.iterrows(): #using the coord dataframe 
    if row['Country of residence']!=0: #to avoid an error      
        folium.Marker([row['Latitude'], row['Longitude']], popup=row['Total ladies']).add_to(LatinAmerica)
LatinAmerica

### What these ladies responded in the survey?

In [None]:
#make a dataframe of only the responses to the survey
answers=ladies.iloc[:,7:356]

#then isolate the specific questions
Languages=answers.iloc[:,0:13]
Langs=Languages.describe()

#renaming columns for ease of visualization
Langs.columns=["Python","R","SQL",'C','C++','Java', 'Javascript','Julia', 'Swift', 'Bash', 'MATLAB', 'None', 'Other' ]

In [None]:
#create a list
Langs=Langs.iloc[0]

#convert to dataframe
Langs=pd.DataFrame(Langs)


In [None]:
#plot the new dataframe
plt.figure(figsize=(10,7))
Langs.plot(kind='bar', color="pink")

Their response does not differ significantly from the entire survey participant pool. Python, SQL and R are the most popular programming languages among women coders residing in Latin America.

In [None]:

Recommended=answers.iloc[:,13:14]
#renaming columns for ease of visualization
Recommended.columns=["Recommended programming language"]
rec=Recommended.describe()
rec

In [None]:
Recommended.value_counts()

They also recommend Python to beginners the most.

In [None]:
Environment=answers.iloc[:,14:26]
#renaming columns for ease of visualization
Environment.columns=["Jupyter","RStudio",'Visual Studio','Visual studio code','PyCharm', 'Spider','Notepad ++', 'Sublime Text', 'Vim / Emacs', 'MATLAB', 'None', 'Other' ]
E=Environment.describe()
E

In [None]:
#plotting a single row
row = E.iloc[0]
row.plot(kind='bar')

They also identified Jupyter as the most-used integrated development environments (IDE's).

In [None]:
Notebook=answers.iloc[:,26:40]
#renaming columns 
Notebook.columns=["Kaggle","Colab",'Azure','Paperspace / Gradient','Binder / JupyterHub', 'Code Ocean','IBM Watson Studio', 'Amazon Sagemaker Studio', 'Amazon EMR', 'Google Cloud AI Platform', 'Google Cloud Datalab', 'Databricks Collaborative','None', 'Other' ]
nb=Notebook.describe()
nb

In [None]:
#plotting a single row
row = nb.iloc[0]
row.plot(kind='barh', color='mediumorchid')

Women residing in Latin America identified Colab as the notebook they use the most, followed by none, and in third place is Kaggle notebook.

In [None]:
libraries=answers.iloc[:,45:58]
#renaming columns 
libraries.columns=["TPU experience","Matplotlib",'Seaborn','Plotly / Plotly Express','Ggplot / ggplot2', 'Shiny','D3 js', 'Altair', 'Bokeh', 'Geoplotlib', 'Leaflet / Folium', 'None', 'Other' ]
lb=libraries.describe()
lb

They also  have no experience with TPU, and Matplotlib as the library they use the most.

In [None]:
MLexperience=answers.iloc[:,58:59]
MLexperience.describe()

In [None]:
#Machine learning frameworks
MLs=answers.iloc[:,59:75]
#renaming columns 
MLs.columns=["Scikit-learn","TensorFlow",'Keras','PyTorch','Fast.ai', 'MXNet','Xgboost', 'LightGBM', 'CatBoost', 'Prophet', 'H2O 3', 'Caret', 'Tidymodels','JAX','None', 'Other' ]
ML=MLs.describe()
ML

In [None]:
#plotting a single row
row = ML.iloc[0]
row.plot(kind='line', color='peru',figsize=(11,5))

Female coders residing in Latin America use the Scikit-learn machine learning framework the most.

In [None]:
algorithms=answers.iloc[:,75:87]
#renaming columns 
a=algorithms.columns=["Regression","Decision Trees or Random Forests",'Gradient Boosting Machines','Bayesian Approaches','Evolutionary Approaches', 'Dense Neural Networks','Convolutional Neural Networks', 'Generative Adversarial Networks', 'Recurrent Neural Networks', 'Transformer Networks','None', 'Other' ]
a=algorithms.describe()
a

In [None]:
#plotting a single row
row = a.iloc[0]

row.plot(kind='line', color='darkgreen',figsize=(11,5))
plt.xticks(rotation = 45)
plt.show()

In [None]:
Job=answers.iloc[:,101:111]
#renaming columns 
Job.columns=["data science work staff","employer uses machine learning",'Top role: Analyze and understand data','Top role: Build and/or run the data infrastructure','Top role: Build prototypes for applying machine learning', 'Top role: Build or run an ML service','Top role: to improve existing ML models', 'Top role: research to advances ML', 'None', 'Other' ]
jobs=Job.describe()
Job.describe()

Most of these women said that their employer does not incorporate machine learning in their place of work. Like the group-at-large, they identified the analysis and understanding of data as their main role in their workplace.

In [None]:
cloud=answers.iloc[:,113:125]
#renaming columns 
cloud.columns=["Amazon Web Services (AWS)","Microsoft Azure",'Google Cloud Platform (GCP)','IBM Cloud / Red Hat','Oracle Cloud', 'SAP Cloud','Salesforce Cloud', 'VMware Cloud', 'Alibaba Cloud',"Tencent",'None', 'Other' ]
cl=cloud.describe()

In [None]:
cl.fillna(0)

In [None]:
row = cl.iloc[0]
row.plot(kind='pie',figsize=(7,7), title="Cloud choice")

Among women coders residing in Latin America, Google Cloud Platform, Amazon Web Services (AWS), and Microsoft Azure are the cloud computing platform of choice, in that order.

In [None]:
public=answers.iloc[:,214:224]
p2=public.describe()
p2

Contrary to the overall participants, most of these women prefer to share their projects, and they do so on GitHub the most. Kaggle was the 4th most chosen medium to share their work publicly.

In [None]:
#get rows from each dataframe for comparison. 
row1 = p1.iloc[0]
row2= p2.iloc[0]

#combine these rows into a new dataframe
compare = [row1, row2]
diff = pd.DataFrame(
    {'General': row1,
     'Latinas': row2     
    })
ax = diff.plot.bar(color=("green","red"))

In [None]:
#swap rows and columns
diff1=diff.transpose()
#rename columns
diff1.columns=["Plotly Dash",'Streamlit','NBViewer', 'GitHub','Personal blog', 'Kaggle', 'Colab', 'Shiny', 'I dont share', 'Other' ]
diff1

In [None]:
ax = diff1.plot.bar(figsize=(10,8), title="Sharing platforms")

In [None]:
platforms1=answers.iloc[:,224:236]
platforms1.describe()

These women identified Coursera as the platform they first learned coding/ML from, followed by Udemy and then, Kaggle.

In [None]:
#Big data product they wish to learn in the near future
BigData=answers.iloc[:,284:302]
#rename columns
BigData.columns=['MySQL','PostgresSQL',"SQLite",'Oracle Database','MongoDB',"Snowflake","IBM Db2","Microsoft SQL Server","Microsoft Access",'Microsoft Azure Data Lake Storage','Amazon Redshift', 'Amazon Athena','Amazon DynamoDB', 'Google Cloud BigQuery', 'Google Cloud SQL', 'Google Cloud Firestore', 'None', 'Other' ]
BD=BigData.describe()

In [None]:
#convert NANs to zeros
BD.fillna(0)

In [None]:
#single row as a series: total count for each choice
bd=BD.iloc[0]
#convert objeact datatype to int
bd.astype('int32')

In [None]:
#pie plot
# Creating color parameters 
colors = ( "rosybrown", "chocolate", "salmon", 
          "grey", "thistle", "beige",'violet',"palevioletred", "red", "darkred", 
         "peru","pink","hotpink","peachpuff","orange",
         "darkmagenta","orchid","darkgoldenrod") 

bd.plot(kind='pie',figsize=(9,9), title="Would like to learn..", autopct='%1.1f%%', colors=colors)

Women coders living in Latin America would like to learn MySQL.