# Encuesta kaggle 2021: ¿España es diferente?
Cada año kaggle hace una encuesta para sondear el estado de Data Science y Machine Learning en todo el mundo. Aquí recogemos los datos principales del ["2021 Kaggle Machine Learning & Data Science Survey"](https://www.kaggle.com/c/kaggle-survey-2021) con el objeto de comparar la situación de España frente al resto del mundo. Los gráficos de la izquierda corresponden a los datos de España y los de la derecha a los del resto del mundo. Se ha recogido un total 25.973 respuestas, de las cuales 454 (1,75%) corresponden a España.

## Resumen
En España 
* la proporción de mujeres es aún menor que en el resto del mundo
* es más habitual acceder al entorno de data science después de un máster, mientras que en el resto del mundo se accede en igual proporción desde grado y desde máster
* respecto a la edad de la gente, el pico de edad más alto corresponde a la franja 45-49 años, en clara oposición a la tendencia del resto del mundo, donde este pico ocupa el 7º lugar
* las personas con más de 20 años de experiencia en programación ocupan el pico más alto, en claro contraste con el resto del mundo donde el pico dominante es el de gente que lleva entre 1 y 3 años programando

Por lo tanto, los datos demográficos de España son muy diferentes a los del resto del mundo. ¿España es diferente?: SI

In [None]:
import numpy as np 
import pandas as pd
import operator

import seaborn as sns
import matplotlib.pyplot as plt

# colorines
sns.set_style("darkgrid")
#sns.set(rc={'axes.facecolor':'cornflowerblue', 'figure.facecolor':'orange'})
sns.set(rc={'axes.facecolor':'#f4cb0b', 'figure.facecolor':'#f8f285'})
Spain_color = 'Red'
ROW_color = 'Grey'

kaggle_survey_2021 = pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv", low_memory=False)

# Traducir
# Global replace
kaggle_survey_2021 = kaggle_survey_2021.replace({'Other': 'Otro'}, regex=True)
kaggle_survey_2021 = kaggle_survey_2021.replace({'None' : 'Ninguno'}, regex=True)

# preguntas individuales
kaggle_survey_2021["Q2"] = kaggle_survey_2021["Q2"].str.replace('Man','Hombre')
kaggle_survey_2021["Q2"] = kaggle_survey_2021["Q2"].str.replace('Woman','Mujer')
order_Q2 = ['Hombre','Mujer']

kaggle_survey_2021["Q4"] = kaggle_survey_2021["Q4"].str.replace('No formal education past high school','Sin estudios')
kaggle_survey_2021["Q4"] = kaggle_survey_2021["Q4"].str.replace('Bachelor’s degree','Grado')
kaggle_survey_2021["Q4"] = kaggle_survey_2021["Q4"].str.replace('Master’s degree','Máster')
kaggle_survey_2021["Q4"] = kaggle_survey_2021["Q4"].str.replace('Doctoral degree','Doctorado')
#order = ["No formal education past high school","Bachelor’s degree","Master’s degree","Doctoral degree"]
order_Q4 = ['Sin estudios','Grado','Máster','Doctorado']

kaggle_survey_2021["Q5"] = kaggle_survey_2021["Q5"].str.replace('Student','Estudiante')
kaggle_survey_2021["Q5"] = kaggle_survey_2021["Q5"].str.replace('Currently not employed','Sin empleo')


kaggle_survey_2021["Q6"] = kaggle_survey_2021["Q6"].str.replace('years','años')
kaggle_survey_2021["Q6"] = kaggle_survey_2021["Q6"].str.replace('I have never written code','Nunca')
order_Q6 = ['Nunca',
         '< 1 años',
         '1-3 años',
         '3-5 años',
         '5-10 años',
         '10-20 años',
         '20+ años']

kaggle_survey_2021["Q15"] = kaggle_survey_2021["Q15"].str.replace('I do not use machine learning methods','No uso ML')
kaggle_survey_2021["Q15"] = kaggle_survey_2021["Q15"].str.replace('Under 1 year','< 1 año')
kaggle_survey_2021["Q15"] = kaggle_survey_2021["Q15"].str.replace('20 or more years','> 20 años')
kaggle_survey_2021["Q15"] = kaggle_survey_2021["Q15"].str.replace('years','años')
order_Q15 = ['No uso ML',
         '< 1 año',
         '1-2 años',
         '2-3 años',
         '3-4 años',
         '5-10 años',
         '10-20 años',
         '> 20 años']

kaggle_survey_2021["Q18_Part_1"] = kaggle_survey_2021["Q18_Part_1"].str.replace('General purpose image/video tools \(PIL, cv2, skimage, etc\)',
                                                                                'Herramientas generales', regex=True)
kaggle_survey_2021["Q18_Part_2"] = kaggle_survey_2021["Q18_Part_2"].str.replace('Image segmentation methods \(U-Net, Mask R-CNN, etc\)',
                                                                                'Segmentación', regex=True)
kaggle_survey_2021["Q18_Part_3"] = kaggle_survey_2021["Q18_Part_3"].str.replace('Object detection methods \(YOLOv3, RetinaNet, etc\)',
                                                                                'Detección de objetos', regex=True)
kaggle_survey_2021["Q18_Part_4"] = kaggle_survey_2021["Q18_Part_4"].str.replace('Image classification and other general purpose networks \(VGG, Inception, ResNet, ResNeXt, NASNet, EfficientNet, etc\)',
                                                                                'Clasificación de imágenes', regex=True)
kaggle_survey_2021["Q18_Part_5"] = kaggle_survey_2021["Q18_Part_5"].str.replace('Generative Networks \(GAN, VAE, etc\)','Redes generativas', regex=True)    
    

kaggle_survey_2021["Q19_Part_1"] = kaggle_survey_2021["Q19_Part_1"].str.replace('Word embeddings/vectors \(GLoVe, fastText, word2vec\)',
                                                                                'Word embeddings/vectors', regex=True)
kaggle_survey_2021["Q19_Part_2"] = kaggle_survey_2021["Q19_Part_2"].str.replace('Encoder-decorder models \(seq2seq, vanilla transformers\)',
                                                                                'modelos Encoder-decorder', regex=True)
kaggle_survey_2021["Q19_Part_3"] = kaggle_survey_2021["Q19_Part_3"].str.replace('Contextualized embeddings \(ELMo, CoVe\)',
                                                                                'Contextualized embeddings', regex=True)
kaggle_survey_2021["Q19_Part_4"] = kaggle_survey_2021["Q19_Part_4"].str.replace('Transformer language models \(GPT-3, BERT, XLnet, etc\)',
                                                                                'modelos Transformer',  regex=True)
    
kaggle_survey_2021["Q20"] = kaggle_survey_2021["Q20"].str.replace('Computers/Technology',
                                                                  'Ordenadores/Tecnología',regex=True) 
kaggle_survey_2021["Q20"] = kaggle_survey_2021["Q20"].str.replace('Academics/Education',
                                                                  'Academia/Educación', regex=True) 
kaggle_survey_2021["Q20"] = kaggle_survey_2021["Q20"].str.replace('Accounting/Finance',
                                                                  'Contabilidad/Finanzas', regex=True)
kaggle_survey_2021["Q20"] = kaggle_survey_2021["Q20"].str.replace('Manufacturing/Fabrication',
                                                                  'Fabricación', regex=True)
kaggle_survey_2021["Q20"] = kaggle_survey_2021["Q20"].str.replace('Medical/Pharmaceutical',
                                                                  'Medica/Farmacéutico', regex=True)
kaggle_survey_2021["Q20"] = kaggle_survey_2021["Q20"].str.replace('Government/Public Service',
                                                                  'Estatal', regex=True)
kaggle_survey_2021["Q20"] = kaggle_survey_2021["Q20"].str.replace('Online Service/Internet-based Services',
                                                                  'Online/Internet', regex=True)
kaggle_survey_2021["Q20"] = kaggle_survey_2021["Q20"].str.replace('Energy/Mining',
                                                                  'Energía/Minería', regex=True)
kaggle_survey_2021["Q20"] = kaggle_survey_2021["Q20"].str.replace('Retail/Sales',
                                                                  'Ventas', regex=True)
kaggle_survey_2021["Q20"] = kaggle_survey_2021["Q20"].str.replace('Insurance/Risk Assessment',
                                                                  'Seguros/Asesoramiento de riesgo', regex=True)
kaggle_survey_2021["Q20"] = kaggle_survey_2021["Q20"].str.replace('Marketing/CRM',
                                                                  'Marketing', regex=True)
kaggle_survey_2021["Q20"] = kaggle_survey_2021["Q20"].str.replace('Broadcasting/Communications',
                                                                  'Media/Comunicaciones', regex=True)
kaggle_survey_2021["Q20"] = kaggle_survey_2021["Q20"].str.replace('Shipping/Transportation',
                                                                  'Transportación', regex=True)
kaggle_survey_2021["Q20"] = kaggle_survey_2021["Q20"].str.replace('Non-profit/Service',
                                                                  'NGO', regex=True)
kaggle_survey_2021["Q20"] = kaggle_survey_2021["Q20"].str.replace('Online Business/Internet-based Sales',
                                                                  'Ventas online', regex=True)
kaggle_survey_2021["Q20"] = kaggle_survey_2021["Q20"].str.replace('Military/Security/Defense',
                                                                  'Defensa/Seguridad', regex=True)
kaggle_survey_2021["Q20"] = kaggle_survey_2021["Q20"].str.replace('Hospitality/Entertainment/Sports',
                                                                  'Deportes/Ocio', regex=True)

kaggle_survey_2021["Q21"] = kaggle_survey_2021["Q21"].str.replace('10,000 or more employees','> 10,000 empleados')
kaggle_survey_2021["Q21"] = kaggle_survey_2021["Q21"].str.replace('employees','empleados')
order_Q21 = ['0-49 empleados',
         '50-249 empleados',
         '250-999 empleados',
         '1000-9,999 empleados',
         '> 10,000 empleados']

kaggle_survey_2021["Q23"] = kaggle_survey_2021["Q23"].str.replace('We are exploring ML methods \(and may one day put a model into production\)',
                                                                  'No, pero lo estamos considerando', regex=True)
kaggle_survey_2021["Q23"] = kaggle_survey_2021["Q23"].str.replace('No \(we do not use ML methods\)',
                                                                  'No usamos ML', regex=True)
kaggle_survey_2021["Q23"] = kaggle_survey_2021["Q23"].str.replace('I do not know',
                                                                  'No sé', regex=True)
kaggle_survey_2021["Q23"] = kaggle_survey_2021["Q23"].str.replace('We have well established ML methods \(i.e., models in production for more than 2 years\)',
                                                                  'Si (> de 2 años)', regex=True)
kaggle_survey_2021["Q23"] = kaggle_survey_2021["Q23"].str.replace('We recently started using ML methods \(i.e., models in production for less than 2 years\)',
                                                                  'Estamos empezando (< de 2 años)', regex=True)
kaggle_survey_2021["Q23"] = kaggle_survey_2021["Q23"].str.replace('We use ML methods for generating insights \(but do not put working models into production\)',
                                                                  'Lo usamos solo para generar insights', regex=True)

kaggle_survey_2021["Q24_Part_1"] = kaggle_survey_2021["Q24_Part_1"].str.replace('Analyze and understand data to influence product or business decisions',
                                                                                'Análisis para tomar decisiones')
kaggle_survey_2021["Q24_Part_2"] = kaggle_survey_2021["Q24_Part_2"].str.replace('Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data',
                                                                                'Construir y/o mantener la infraestructura de datos', regex=True)
kaggle_survey_2021["Q24_Part_3"] = kaggle_survey_2021["Q24_Part_3"].str.replace('Build prototypes to explore applying machine learning to new areas',
                                                                                'Construir prototipos', regex=True)
kaggle_survey_2021["Q24_Part_4"] = kaggle_survey_2021["Q24_Part_4"].str.replace('Build and/or run a machine learning service that operationally improves my product or workflows',
                                                                                'Construir y/o mantener un servicio de ML', regex=True)
kaggle_survey_2021["Q24_Part_5"] = kaggle_survey_2021["Q24_Part_5"].str.replace('Experimentation and iteration to improve existing ML models',
                                                                                'Experimentación con, y mejora de modelos ML ya existentes', regex=True)
kaggle_survey_2021["Q24_Part_6"] = kaggle_survey_2021["Q24_Part_6"].str.replace('Do research that advances the state of the art of machine learning',
                                                                                'Investigación/estado del arte de ML', regex=True)
kaggle_survey_2021["Q24_Part_7"] = kaggle_survey_2021["Q24_Part_7"].str.replace('Ninguno of these activities are an important part of my role at work',
                                                                                'Ninguno de estos actividades', regex=True)
                                                                                

kaggle_survey_2021["Q26"] = kaggle_survey_2021["Q26"].str.replace('\$100,000 or more \($USD\)','> $100,000', regex=True)
order_26 = ['$0 ($USD)','$1-$99','$100-$999','$1000-$9,999', '$10,000-$99,999', '> $100,000']

kaggle_survey_2021["Q36_A_Part_1"] = kaggle_survey_2021["Q36_A_Part_1"].str.replace('Automated data augmentation \(e.g. imgaug, albumentations\)',
                                                                                    'Automated data augmentation', regex=True)
kaggle_survey_2021["Q36_A_Part_2"] = kaggle_survey_2021["Q36_A_Part_2"].str.replace('Automated feature engineering/selection \(e.g. tpot, boruta_py\)',
                                                                                    'Automated feature engineering/selection', regex=True)
kaggle_survey_2021["Q36_A_Part_3"] = kaggle_survey_2021["Q36_A_Part_3"].str.replace('Automated model selection \(e.g. auto-sklearn, xcessiv\)',
                                                                                    'Automated model selection',regex=True)
kaggle_survey_2021["Q36_A_Part_4"] = kaggle_survey_2021["Q36_A_Part_4"].str.replace('Automated model architecture searches \(e.g. darts, enas\)',
                                                                                    'Automated model architecture searches',regex=True)
kaggle_survey_2021["Q36_A_Part_5"] = kaggle_survey_2021["Q36_A_Part_5"].str.replace('Automated hyperparameter tuning \(e.g. hyperopt, ray.tune, Vizier\)',
                                                                                    'Automated hyperparameter tuning',regex=True)
kaggle_survey_2021["Q36_A_Part_6"] = kaggle_survey_2021["Q36_A_Part_6"].str.replace('Automation of full ML pipelines \(e.g. Google AutoML, H2O Driverless AI\)',
                                                                                    'Automation of full ML pipelines',regex=True)
    
    
kaggle_survey_2021["Q40_Part_9"] = kaggle_survey_2021["Q40_Part_9"].str.replace('Cloud-certification programs \(direct from AWS, Azure, GCP, or similar\)',
                                                                                  'Cloud-certification programs', regex=True)

kaggle_survey_2021["Q40_Part_10"] = kaggle_survey_2021["Q40_Part_10"].str.replace('University Courses \(resulting in a university degree\)',
                                                                                  'Asignaturas universitarios', regex=True)

kaggle_survey_2021["Q41"] = kaggle_survey_2021["Q41"].str.replace('Basic statistical software \(Microsoft Excel, Google Sheets, etc.\)',
                                                                  'Programas de estadística básica', regex=True)
kaggle_survey_2021["Q41"] = kaggle_survey_2021["Q41"].str.replace('Local development environments \(RStudio, JupyterLab, etc.\)',
                                                                  'Entornos de desarrollo local', regex=True)
kaggle_survey_2021["Q41"] = kaggle_survey_2021["Q41"].str.replace('Business intelligence software \(Salesforce, Tableau, Spotfire, etc.\)',
                                                                  'Business intelligence software', regex=True)
kaggle_survey_2021["Q41"] = kaggle_survey_2021["Q41"].str.replace('Cloud-based data software & APIs \(AWS, GCP, Azure, etc.\)',
                                                                  'Cloud-based data software & APIs', regex=True)
kaggle_survey_2021["Q41"] = kaggle_survey_2021["Q41"].str.replace('Advanced statistical software \(SPSS, SAS, etc.\)',
                                                                  'Programas de estadística avanzada', regex=True)

# split
kaggle_Spain = kaggle_survey_2021.query("Q3 == 'Spain'").reset_index(drop = True)
kaggle_ROW   = kaggle_survey_2021.query("Q3 != 'Spain'").reset_index(drop = True)
#kaggle_ROW   = kaggle_survey_2021.query("Q3 == 'France'").reset_index(drop = True)

#  delete the questions in row 0
kaggle_ROW   = kaggle_ROW.drop(0)

# <center style="background-color:Gainsboro; width:20%;">Edad</center>
Primero empezamos por las edades de los encuestados:

In [None]:
order = kaggle_Spain["Q1"].value_counts().index.sort_values()

plt.figure(figsize=(15, 5))
plt.suptitle('Edad', fontsize=22)
plt.subplot(1, 2, 1)
plt.title('España', fontsize=14)
sns.countplot(x="Q1", data=kaggle_Spain, order=order, color=Spain_color)
plt.xlabel("")
plt.subplot(1, 2, 2)
plt.title('El resto del mundo', fontsize=14)
sns.countplot(x="Q1", data=kaggle_ROW, order=order, color=ROW_color)
plt.xlabel("")
plt.show()

En España podemos ver claramente dos picos; el de 25-29 años, que probablemente corresponde a recién graduados de la universidad en las nuevas carreras que incorporan "data science", "big data", etc. y otro pico a los 45-49 años, correspondiente a gente '*senior*'. En cambio, en el resto del mundo la gran mayoría de los encuestados son menores de 30 años, en concreto destaca el gran número de jóvenes entre 18-21 años.
# <center style="background-color:Gainsboro; width:40%;">Género</center>

In [None]:
plt.figure(figsize=(15, 5))
plt.suptitle('Género', fontsize=22)
plt.subplot(1, 2, 1)
plt.title('España', fontsize=14)
ax1 = sns.countplot(x="Q2", data=kaggle_Spain, order=order_Q2, color=Spain_color)
plt.tick_params(axis='x', rotation=10)
plt.xlabel("")
plt.subplot(1, 2, 2)
plt.title('El resto del mundo', fontsize=14)
ax2 = sns.countplot(x="Q2", data=kaggle_ROW, order=order_Q2, color=ROW_color)
plt.tick_params(axis='x', rotation=10)
plt.xlabel("")
for p in ax1.patches:
    ax1.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='center', va='top', color='white', size=14)
for p in ax2.patches:
    ax2.annotate(f'\n{p.get_height()}', (p.get_x()+0.2, p.get_height()), ha='center', va='top', color='white', size=14)
plt.show()

Queda muy evidente el desequilibrio entre hombres y mujeres, que es muy significativo en el mundo (1:4,2)
pero peor aún en España, con una relación de 1:6,8
# <center style="background-color:Gainsboro; width:40%;">Nivel educativo</center>

In [None]:
plt.figure(figsize=(15, 5))
plt.suptitle('Nivel educativo', fontsize=22)
plt.subplot(1, 2, 1)
plt.title('España', fontsize=14)
sns.countplot(x="Q4", data=kaggle_Spain, order=order_Q4, color=Spain_color)
plt.tick_params(axis='x', rotation=10)
plt.xlabel("")
plt.subplot(1, 2, 2)
plt.title('El resto del mundo', fontsize=14)
sns.countplot(x="Q4", data=kaggle_ROW, order=order_Q4, color=ROW_color)
plt.tick_params(axis='x', rotation=10)
plt.xlabel("")
plt.show()

Es interesante ver que en España la mayor parte de las personas involucradas actualmente en la ciencia de datos han realizado un máster, mientras que en el resto del mundo ya desde el grado acceden a esta rama de conocimiento.

# <center style="background-color:Gainsboro; width:60%;">Descripción del puesto de trabajo</center>

In [None]:
plt.figure(figsize=(15, 5))
plt.suptitle('Descripción del puesto de trabajo', fontsize=22)
plt.subplot(1, 2, 1)
plt.title('España', fontsize=14)
sns.countplot(x="Q5", data=kaggle_Spain,  color="Red", order = kaggle_Spain['Q5'].value_counts().index)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
plt.title('El resto del mundo', fontsize=14)
sns.countplot(x="Q5", data=kaggle_ROW,  color=ROW_color, order = kaggle_ROW['Q5'].value_counts().index)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

Es interesante ver que en el resto del mundo el número de estudianteses casi el doble que el de profesionales, mientras que en España los estudiantes ocupan el tercer puesto en la tabla. 

# <center style="background-color:Gainsboro; width:80%;">¿Cuantos años llevas programando?</center>

In [None]:
plt.figure(figsize=(15, 5))
plt.suptitle('¿Cuantos años llevas programando?', fontsize=20)
plt.subplot(1, 2, 1)
sns.countplot(x="Q6", data=kaggle_Spain,order=order_Q6, color=Spain_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
sns.countplot(x="Q6", data=kaggle_ROW,order=order_Q6, color=ROW_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

En España vemos una gran presencia de programadores "veteranos", siendo el pico más alto el que corresponde a los *senior*, con más de 20 años de experiencia programando.
# <center style="background-color:Gainsboro; width:60%;">Lenguajes de programación</center>

In [None]:
n_parts = 13
first_col = 7

ROW_list   =[]
Spain_list =[]
Q7_list    =[]

for n in range(first_col,first_col+n_parts):
    ROW_list.append(kaggle_ROW.iloc[:,n].count() )
    Spain_list.append(kaggle_Spain.iloc[:,n].count() )
    Q7_list.append(kaggle_survey_2021.iloc[0,n].split()[-1])
    
Spain_dictionary = dict(zip(Q7_list, Spain_list))
ROW_dictionary = dict(zip(Q7_list, ROW_list))

Spain_dictionary_sorted = sorted(Spain_dictionary.items(), key=operator.itemgetter(1), reverse = True)
ROW_dictionary_sorted   = sorted(ROW_dictionary.items(), key=operator.itemgetter(1), reverse = True)

Spain_df = pd.DataFrame(Spain_dictionary_sorted, columns=['Language', 'count'])
ROW_df = pd.DataFrame(ROW_dictionary_sorted, columns=['Language', 'count'])

plt.figure(figsize=(15, 5))
plt.suptitle('Lenguajes de programación más usados', fontsize=20)
plt.subplot(1, 2, 1)
sns.barplot(x="Language", y="count", data=Spain_df, color=Spain_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
sns.barplot(x="Language", y="count", data=ROW_df, color=ROW_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

Vemos que tanto en España como en el resto del mundo el lenguaje preferido es [python](https://www.python.org/), seguido por SQL. En España también es muy popular el lenguaje estadístico [R](https://www.r-project.org/about.html).
# <center style="background-color:Gainsboro; width:80%;">¿Qué integrated development environments (IDE's) usas con frecuencia?</center>

In [None]:
n_parts = 13
first_col = 21

ROW_list   =[]
Spain_list =[]
Q7_list    =[]

for n in range(first_col,first_col+n_parts):
    ROW_list.append(kaggle_ROW.iloc[:,n].count() )
    Spain_list.append(kaggle_Spain.iloc[:,n].count() )
    Q7_list.append(kaggle_survey_2021.iloc[0,n].split('-')[-1])
    
Spain_dictionary = dict(zip(Q7_list, Spain_list))
ROW_dictionary = dict(zip(Q7_list, ROW_list))

Spain_dictionary_sorted = sorted(Spain_dictionary.items(), key=operator.itemgetter(1), reverse = True)
ROW_dictionary_sorted   = sorted(ROW_dictionary.items(), key=operator.itemgetter(1), reverse = True)

Spain_df = pd.DataFrame(Spain_dictionary_sorted, columns=['Language', 'count'])
ROW_df = pd.DataFrame(ROW_dictionary_sorted, columns=['Language', 'count'])

plt.figure(figsize=(15, 5))
plt.suptitle('IDE usada con frecuencia', fontsize=20)
plt.subplot(1, 2, 1)
sns.barplot(x="Language", y="count", data=Spain_df, color=Spain_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
sns.barplot(x="Language", y="count", data=ROW_df, color=ROW_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

Basta decir que en kaggle el entorno de desarrollo interactivo (IDE) más popular es el [Jupyter Notebook](https://jupyter.org/) que forma parte del entorno de kaggle. El segundo IDE más popular 
es el [Visual Studio Code](https://code.visualstudio.com/) de Microsoft. En tercer lugar mundial lo ocupa [PyCharm](https://www.jetbrains.com/pycharm/), en cambio [RStudio](https://www.rstudio.com/) tiene mucho más protagonismo en España.
# <center style="background-color:Gainsboro; width:80%;">Librerías de visualización de datos</center>

In [None]:
first_col = 59
n_parts = 12

ROW_list   =[]
Spain_list =[]
Q7_list    =[]

for n in range(first_col,first_col+n_parts):
    ROW_list.append(kaggle_ROW.iloc[:,n].count() )
    Spain_list.append(kaggle_Spain.iloc[:,n].count() )
    Q7_list.append(kaggle_survey_2021.iloc[0,n].split('-')[-1])
    
Spain_dictionary = dict(zip(Q7_list, Spain_list))
ROW_dictionary = dict(zip(Q7_list, ROW_list))

Spain_dictionary_sorted = sorted(Spain_dictionary.items(), key=operator.itemgetter(1), reverse = True)
ROW_dictionary_sorted   = sorted(ROW_dictionary.items(), key=operator.itemgetter(1), reverse = True)

Spain_df = pd.DataFrame(Spain_dictionary_sorted, columns=['Language', 'count'])
ROW_df = pd.DataFrame(ROW_dictionary_sorted, columns=['Language', 'count'])

plt.figure(figsize=(15, 5))
plt.suptitle('Librerías de visualización de datos', fontsize=20)
plt.subplot(1, 2, 1)
sns.barplot(x="Language", y="count", data=Spain_df, color=Spain_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
sns.barplot(x="Language", y="count", data=ROW_df, color=ROW_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

Cuando llega al momento de plotear resultados la librería más popular es [matplotlib](https://matplotlib.org/) para Python, seguido por [seaborn](https://seaborn.pydata.org/) (basado en matplotlib). En España la siguiente librería es [ggplot2](https://ggplot2.tidyverse.org/), parte de los paquetes de tidyverse para R. 
# <center style="background-color:Gainsboro; width:40%;">ML ¿Cuántos años?</center>

In [None]:
plt.figure(figsize=(15, 5))
plt.suptitle('¿Cuántos años llevas con ML?', fontsize=20)
plt.subplot(1, 2, 1)
sns.countplot(x="Q15", data=kaggle_Spain,order=order_Q15, color=Spain_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
sns.countplot(x="Q15", data=kaggle_ROW,order=order_Q15, color=ROW_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

Es interesante que mucha gente en kaggle es nueva en el mundo de machine learning. Eso implica que el pico *senior* en España corresponde a gente que ha hecho un 'reskilling'.
# <center style="background-color:Gainsboro; width:40%;">ML/DL Frameworks</center>
Ya miramos a los machine learning (ML) y deep learning (DL) frameworks

In [None]:
first_col = 72
n_parts = 18

ROW_list   =[]
Spain_list =[]
Q7_list    =[]

for n in range(first_col,first_col+n_parts):
    ROW_list.append(kaggle_ROW.iloc[:,n].count() )
    Spain_list.append(kaggle_Spain.iloc[:,n].count() )
    Q7_list.append(kaggle_survey_2021.iloc[0,n].split(' -')[-1])
    
Spain_dictionary = dict(zip(Q7_list, Spain_list))
ROW_dictionary = dict(zip(Q7_list, ROW_list))

Spain_dictionary_sorted = sorted(Spain_dictionary.items(), key=operator.itemgetter(1), reverse = True)
ROW_dictionary_sorted   = sorted(ROW_dictionary.items(), key=operator.itemgetter(1), reverse = True)

Spain_df = pd.DataFrame(Spain_dictionary_sorted, columns=['Language', 'count'])
ROW_df = pd.DataFrame(ROW_dictionary_sorted, columns=['Language', 'count'])

plt.figure(figsize=(15, 5))
plt.suptitle('ML/DL Frameworks', fontsize=20)
plt.subplot(1, 2, 1)
sns.barplot(x="Language", y="count", data=Spain_df, color=Spain_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
sns.barplot(x="Language", y="count", data=ROW_df, color=ROW_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

Las rutinas de [scikit-learn](https://scikit-learn.org/stable/index.html) de machine learning son los más populares, tanto en España como en el resto del mundo, seguidos por las librerías de deep learning [TensorFlow](https://www.tensorflow.org/), [Keras](https://keras.io/) y [PyTorch](https://pytorch.org/) para redes neuronales.
Para gradient boosting lo más popular es [XGBoost](https://xgboost.ai/), por encima de [LightGBM](https://github.com/Microsoft/LightGBM) y [CatBoost](https://catboost.ai/).
# <center style="background-color:Gainsboro; width:40%;">ML Algoritmos</center>

In [None]:
first_col = 90
n_parts = 12

ROW_list   =[]
Spain_list =[]
Q7_list    =[]

for n in range(first_col,first_col+n_parts):
    ROW_list.append(kaggle_ROW.iloc[:,n].count() )
    Spain_list.append(kaggle_Spain.iloc[:,n].count() )
    Q7_list.append(kaggle_survey_2021.iloc[0,n].split(' -')[-1])
    
Spain_dictionary = dict(zip(Q7_list, Spain_list))
ROW_dictionary = dict(zip(Q7_list, ROW_list))

Spain_dictionary_sorted = sorted(Spain_dictionary.items(), key=operator.itemgetter(1), reverse = True)
ROW_dictionary_sorted   = sorted(ROW_dictionary.items(), key=operator.itemgetter(1), reverse = True)

Spain_df = pd.DataFrame(Spain_dictionary_sorted, columns=['Language', 'count'])
ROW_df = pd.DataFrame(ROW_dictionary_sorted, columns=['Language', 'count'])

plt.figure(figsize=(15, 5))
plt.suptitle('Algoritmos', fontsize=20)
plt.subplot(1, 2, 1)
sns.barplot(x="Language", y="count", data=Spain_df, color=Spain_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
sns.barplot(x="Language", y="count", data=ROW_df, color=ROW_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

Podemos ver que los algoritmos más básicos; regresión lineal, y regresión logística para problemas de clasificación, son las más populares, seguido por árboles de decisión, y después por los algoritmos de gradient boosting. Los redes neuronales ocupan los cuarto y quinto lugares.
# <center style="background-color:Gainsboro; width:60%;">Técnicas de visión artificial</center>

In [None]:
first_col = 102
n_parts = 7

ROW_list   =[]
Spain_list =[]
Q7_list    =[]

for n in range(first_col,first_col+n_parts):
    ROW_list.append(kaggle_ROW.iloc[:,n].count() )
    Spain_list.append(kaggle_Spain.iloc[:,n].count() )
    Q7_list.append(kaggle_survey_2021.iloc[0,n].split(' -')[-1])
    
Spain_dictionary = dict(zip(Q7_list, Spain_list))
ROW_dictionary = dict(zip(Q7_list, ROW_list))

Spain_dictionary_sorted = sorted(Spain_dictionary.items(), key=operator.itemgetter(1), reverse = True)
ROW_dictionary_sorted   = sorted(ROW_dictionary.items(), key=operator.itemgetter(1), reverse = True)

Spain_df = pd.DataFrame(Spain_dictionary_sorted, columns=['Language', 'count'])
ROW_df = pd.DataFrame(ROW_dictionary_sorted, columns=['Language', 'count'])

plt.figure(figsize=(15, 5))
plt.suptitle('Visión artificial', fontsize=20)
plt.subplot(1, 2, 1)
sns.barplot(x="Language", y="count", data=Spain_df, color=Spain_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
sns.barplot(x="Language", y="count", data=ROW_df, color=ROW_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

* Herramientas generales: [PIL (pillow)](https://python-pillow.org/), [cv2 (OpenCV)](https://github.com/opencv/opencv-python), [skimage (scikit-image)](https://scikit-image.org/docs/stable/api/skimage.html), *etc*.
* Segmentacion: [U-Net](https://arxiv.org/pdf/1505.04597v1.pdf), [Mask R-CNN](https://arxiv.org/pdf/1703.06870.pdf), *etc.*
* Detección de objetos: [YOLOv3](https://github.com/ultralytics/yolov3), [RetinaNet](https://arxiv.org/pdf/1708.02002v2.pdf), *etc.*
* Clasificación de imágenes: [VGG](https://keras.io/api/applications/vgg/), [Inception](https://keras.io/api/applications/inceptionv3/), [ResNet](https://github.com/pskrunner14/resnet-classifier), [ResNeXt](https://arxiv.org/pdf/1611.05431v2.pdf), [NASNet](https://arxiv.org/pdf/1912.03151v1.pdf), [EfficientNet](https://ai.googleblog.com/2019/05/efficientnet-improving-accuracy-and.html), *etc.*
* Redes generativas: [GAN](https://es.wikipedia.org/wiki/Red_generativa_antag%C3%B3nica), [VAE](https://en.wikipedia.org/wiki/Variational_autoencoder), *etc.*

Aquí vemos que la aplicación más habitual de visión artificial es la clasificación de imágenes.

# <center style="background-color:Gainsboro; width:80%;">Frameworks de procesamiento de lenguaje natural (NLP)</center>

In [None]:
first_col = 109
n_parts = 6

ROW_list   =[]
Spain_list =[]
Q7_list    =[]

for n in range(first_col,first_col+n_parts):
    ROW_list.append(kaggle_ROW.iloc[:,n].count() )
    Spain_list.append(kaggle_Spain.iloc[:,n].count() )
    Q7_list.append(kaggle_survey_2021.iloc[0,n].split(' -')[-1])
    
Spain_dictionary = dict(zip(Q7_list, Spain_list))
ROW_dictionary = dict(zip(Q7_list, ROW_list))

Spain_dictionary_sorted = sorted(Spain_dictionary.items(), key=operator.itemgetter(1), reverse = True)
ROW_dictionary_sorted   = sorted(ROW_dictionary.items(), key=operator.itemgetter(1), reverse = True)

Spain_df = pd.DataFrame(Spain_dictionary_sorted, columns=['Language', 'count'])
ROW_df = pd.DataFrame(ROW_dictionary_sorted, columns=['Language', 'count'])

plt.figure(figsize=(15, 5))
plt.suptitle('NLP Frameworks', fontsize=20)
plt.subplot(1, 2, 1)
sns.barplot(x="Language", y="count", data=Spain_df, color=Spain_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
sns.barplot(x="Language", y="count", data=ROW_df, color=ROW_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

* Word embeddings/vectors: [GLoVe](https://nlp.stanford.edu/projects/glove/), [fastText](https://fasttext.cc/), [word2vec](https://en.wikipedia.org/wiki/Word2vec) *etc.*
* Encoder-decoder models: [seq2seq](https://google.github.io/seq2seq/), vanilla transformers, *etc*.
* Contextualized embeddings: [ELMo](https://github.com/HIT-SCIR/ELMoForManyLangs), CoVe
* Transformer language models: [GPT-3](https://es.wikipedia.org/wiki/GPT-3), [BERT](https://arxiv.org/pdf/1810.04805.pdf), [XLnet](https://github.com/zihangdai/xlnet), *etc*.

Vemos que apenas hay diferencia entre España y el resto del mundo en las aplicaciones de NLP.

# <center style="background-color:Gainsboro; width:60%;">¿En qué industria trabajas?</center>

In [None]:
question_num = 'Q20'
plt.figure(figsize=(15, 5))
plt.suptitle('Industria', fontsize=22)
plt.subplot(1, 2, 1)
plt.title('España', fontsize=14)
sns.countplot(x=question_num, data=kaggle_Spain,  color=Spain_color, order = kaggle_Spain[question_num].value_counts().index)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
plt.title('El resto del mundo', fontsize=14)
sns.countplot(x=question_num, data=kaggle_ROW,  color=ROW_color, order = kaggle_ROW[question_num].value_counts().index)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

Podemos ver que en España el grupo de gente más numeroso que usan kaggle son académicos o universitarios, seguido por informáticos. En tercer lugar es la industria financiera. En el resto del mundo el primer lugar está tomado por el sector informático y los académicos/universitarios ocupan el segundo lugar.
# <center style="background-color:Gainsboro; width:60%;">Tamaño de empresa</center>

In [None]:
question_num = 'Q21'

plt.figure(figsize=(15, 5))
plt.suptitle('Tamaño de empresa', fontsize=22)
plt.subplot(1, 2, 1)
plt.title('España', fontsize=14)
sns.countplot(x=question_num, data=kaggle_Spain,  color=Spain_color, order = order_Q21)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
plt.title('El resto del mundo', fontsize=14)
sns.countplot(x=question_num, data=kaggle_ROW,  color=ROW_color, order = order_Q21)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

Podemos ver que un gran numero de los encuestados trabajan en PYMES.

# <center style="background-color:Gainsboro; width:80%;">Número de gente responsable de Data Science</center>

In [None]:
question_num = 'Q22'

order = ['0',
         '1-2',
         '3-4',
         '5-9',
         '10-14',
        '15-19',
        '20+']

plt.figure(figsize=(15, 5))
plt.suptitle('Numero de gente en DS', fontsize=22)
plt.subplot(1, 2, 1)
plt.title('España', fontsize=14)
sns.countplot(x=question_num, data=kaggle_Spain,  color=Spain_color, order = order)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
plt.title('El resto del mundo', fontsize=14)
sns.countplot(x=question_num, data=kaggle_ROW,  color=ROW_color, order = order)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

# <center style="background-color:Gainsboro; width:80%;">¿Tu empresa usa modelos de ML?</center>

In [None]:
question_num = 'Q23'
plt.figure(figsize=(15, 5))
plt.suptitle('Uso de modelos de ML en tu empresa', fontsize=22)
plt.subplot(1, 2, 1)
plt.title('España', fontsize=14)
sns.countplot(x=question_num, data=kaggle_Spain,  color=Spain_color, order = kaggle_Spain[question_num].value_counts().index)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
plt.title('El resto del mundo', fontsize=14)
sns.countplot(x=question_num, data=kaggle_ROW,  color=ROW_color, order = kaggle_ROW[question_num].value_counts().index)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

Vemos aquí que España es por delante del resto del mundo en que muchos empresas están empezando a incorporar ML como parte de su infraestructura.


# <center style="background-color:Gainsboro; width:60%;">Tu actividad en tu empresa</center>

In [None]:
n_parts = 8
first_col = 119

ROW_list   =[]
Spain_list =[]
Q7_list    =[]

for n in range(first_col,first_col+n_parts):
    ROW_list.append(kaggle_ROW.iloc[:,n].count() )
    Spain_list.append(kaggle_Spain.iloc[:,n].count() )
    Q7_list.append(kaggle_survey_2021.iloc[0,n].split('-')[-1])
    
Spain_dictionary = dict(zip(Q7_list, Spain_list))
ROW_dictionary = dict(zip(Q7_list, ROW_list))

Spain_dictionary_sorted = sorted(Spain_dictionary.items(), key=operator.itemgetter(1), reverse = True)
ROW_dictionary_sorted   = sorted(ROW_dictionary.items(), key=operator.itemgetter(1), reverse = True)

Spain_df = pd.DataFrame(Spain_dictionary_sorted, columns=['Language', 'count'])
ROW_df = pd.DataFrame(ROW_dictionary_sorted, columns=['Language', 'count'])

plt.figure(figsize=(15, 5))
plt.suptitle('Tu actividad en tu empresa', fontsize=20)
plt.subplot(1, 2, 1)
sns.barplot(x="Language", y="count", data=Spain_df, color=Spain_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
sns.barplot(x="Language", y="count", data=ROW_df, color=ROW_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

Podemos ver que principalmente ML está aplicado a la toma de decisiones.

# <center style="background-color:Gainsboro; width:60%;">Salario (USD)</center>

In [None]:
question_num = 'Q25'

order = [#'$0-999',
         '1,000-1,999','2,000-2,999','3,000-3,999','4,000-4,999','5,000-7,499',
         '7,500-9,999','10,000-14,999','15,000-19,999','20,000-24,999','25,000-29,999',
         '30,000-39,999','40,000-49,999','50,000-59,999','60,000-69,999','70,000-79,999',
         '80,000-89,999', '90,000-99,999','100,000-124,999','125,000-149,999','150,000-199,999',
         '200,000-249,999']

plt.figure(figsize=(15, 5))
plt.suptitle('Salario (USD)', fontsize=22)
plt.subplot(1, 2, 1)
plt.title('España', fontsize=14)
sns.countplot(x=question_num, data=kaggle_Spain,  color=Spain_color, order = order)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
plt.title('El resto del mundo', fontsize=14)
sns.countplot(x=question_num, data=kaggle_survey_2021,  color=ROW_color, order = order)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

Se puede ver en España un pico en el rango de \\$40-50k anuales.
(Nota: He omitido ingresos de menos de \$1k para claridad).

# <center style="background-color:Gainsboro; width:80%;">Dinero gastado en los últimos 5 años (USD)</center>
En ML o servicios de cloud computing

In [None]:
question_num = 'Q26'

plt.figure(figsize=(15, 5))
plt.suptitle('Dinero gastado en servicios', fontsize=22)
plt.subplot(1, 2, 1)
plt.title('España', fontsize=14)
sns.countplot(x=question_num, data=kaggle_Spain,  color=Spain_color, order = order_26)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
plt.title('El resto del mundo', fontsize=14)
sns.countplot(x=question_num, data=kaggle_survey_2021,  color=ROW_color, order = order_26)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

# <center style="background-color:Gainsboro; width:60%;">Servicios 'Cloud'</center>

In [None]:
first_col = 129
n_parts = 12

ROW_list   =[]
Spain_list =[]
Q7_list    =[]

for n in range(first_col,first_col+n_parts):
    ROW_list.append(kaggle_ROW.iloc[:,n].count() )
    Spain_list.append(kaggle_Spain.iloc[:,n].count() )
    Q7_list.append(kaggle_survey_2021.iloc[0,n].split(' -')[-1])
    
Spain_dictionary = dict(zip(Q7_list, Spain_list))
ROW_dictionary = dict(zip(Q7_list, ROW_list))

Spain_dictionary_sorted = sorted(Spain_dictionary.items(), key=operator.itemgetter(1), reverse = True)
ROW_dictionary_sorted   = sorted(ROW_dictionary.items(), key=operator.itemgetter(1), reverse = True)

Spain_df = pd.DataFrame(Spain_dictionary_sorted, columns=['Language', 'count'])
ROW_df = pd.DataFrame(ROW_dictionary_sorted, columns=['Language', 'count'])

plt.figure(figsize=(15, 5))
plt.suptitle("Servicios 'Cloud'", fontsize=20)
plt.subplot(1, 2, 1)
sns.barplot(x="Language", y="count", data=Spain_df, color=Spain_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
sns.barplot(x="Language", y="count", data=ROW_df, color=ROW_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

Los tres servicios cloud más populares son
* [Amazon AWS](https://aws.amazon.com/)
* [Microsoft Azure](https://azure.microsoft.com/)
* [Google Cloud Platform](https://cloud.google.com/)

# <center style="background-color:Gainsboro; width:60%;">Almacenamiento 'Cloud'</center>

In [None]:
first_col = 147
n_parts = 8

ROW_list   =[]
Spain_list =[]
Q7_list    =[]

for n in range(first_col,first_col+n_parts):
    ROW_list.append(kaggle_ROW.iloc[:,n].count() )
    Spain_list.append(kaggle_Spain.iloc[:,n].count() )
    Q7_list.append(kaggle_survey_2021.iloc[0,n].split(' -')[-1])
    
Spain_dictionary = dict(zip(Q7_list, Spain_list))
ROW_dictionary = dict(zip(Q7_list, ROW_list))

Spain_dictionary_sorted = sorted(Spain_dictionary.items(), key=operator.itemgetter(1), reverse = True)
ROW_dictionary_sorted   = sorted(ROW_dictionary.items(), key=operator.itemgetter(1), reverse = True)

Spain_df = pd.DataFrame(Spain_dictionary_sorted, columns=['Language', 'count'])
ROW_df = pd.DataFrame(ROW_dictionary_sorted, columns=['Language', 'count'])

plt.figure(figsize=(15, 5))
plt.suptitle("Almacenamiento 'Cloud'", fontsize=20)
plt.subplot(1, 2, 1)
sns.barplot(x="Language", y="count", data=Spain_df, color=Spain_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
sns.barplot(x="Language", y="count", data=ROW_df, color=ROW_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

Los dos servicios de almacenamiento cloud más populares son
* [Amazon Simple Storage Service (Amazon S3)](https://aws.amazon.com/s3/)
* [Google Cloud Storage (GCS)](https://cloud.google.com/storage/)

# <center style="background-color:Gainsboro; width:60%;">'Cloud' ML services</center>

In [None]:
first_col = 155
n_parts = 10

ROW_list   =[]
Spain_list =[]
Q7_list    =[]

for n in range(first_col,first_col+n_parts):
    ROW_list.append(kaggle_ROW.iloc[:,n].count() )
    Spain_list.append(kaggle_Spain.iloc[:,n].count() )
    Q7_list.append(kaggle_survey_2021.iloc[0,n].split(' -')[-1])
    
Spain_dictionary = dict(zip(Q7_list, Spain_list))
ROW_dictionary = dict(zip(Q7_list, ROW_list))

Spain_dictionary_sorted = sorted(Spain_dictionary.items(), key=operator.itemgetter(1), reverse = True)
ROW_dictionary_sorted   = sorted(ROW_dictionary.items(), key=operator.itemgetter(1), reverse = True)

Spain_df = pd.DataFrame(Spain_dictionary_sorted, columns=['Language', 'count'])
ROW_df = pd.DataFrame(ROW_dictionary_sorted, columns=['Language', 'count'])

plt.figure(figsize=(15, 5))
plt.suptitle('Cloud ML', fontsize=20)
plt.subplot(1, 2, 1)
sns.barplot(x="Language", y="count", data=Spain_df, color=Spain_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
sns.barplot(x="Language", y="count", data=ROW_df, color=ROW_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

# <center style="background-color:Gainsboro; width:60%;">'Big Data' services</center>

In [None]:
first_col = 165
n_parts = 21

ROW_list   =[]
Spain_list =[]
Q7_list    =[]

for n in range(first_col,first_col+n_parts):
    ROW_list.append(kaggle_ROW.iloc[:,n].count() )
    Spain_list.append(kaggle_Spain.iloc[:,n].count() )
    Q7_list.append(kaggle_survey_2021.iloc[0,n].split(' -')[-1])
    
Spain_dictionary = dict(zip(Q7_list, Spain_list))
ROW_dictionary = dict(zip(Q7_list, ROW_list))

Spain_dictionary_sorted = sorted(Spain_dictionary.items(), key=operator.itemgetter(1), reverse = True)
ROW_dictionary_sorted   = sorted(ROW_dictionary.items(), key=operator.itemgetter(1), reverse = True)

Spain_df = pd.DataFrame(Spain_dictionary_sorted, columns=['Language', 'count'])
ROW_df = pd.DataFrame(ROW_dictionary_sorted, columns=['Language', 'count'])

plt.figure(figsize=(15, 5))
plt.suptitle('Big data services', fontsize=20)
plt.subplot(1, 2, 1)
sns.barplot(x="Language", y="count", data=Spain_df, color=Spain_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
sns.barplot(x="Language", y="count", data=ROW_df, color=ROW_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

Cuando trata de Big Data [MySQL](https://www.mysql.com/) es lo más popular.
# <center style="background-color:Gainsboro; width:60%;">Business Intelligence (BI)</center>

In [None]:
first_col = 187
n_parts = 17

ROW_list   =[]
Spain_list =[]
Q7_list    =[]

for n in range(first_col,first_col+n_parts):
    ROW_list.append(kaggle_ROW.iloc[:,n].count() )
    Spain_list.append(kaggle_Spain.iloc[:,n].count() )
    Q7_list.append(kaggle_survey_2021.iloc[0,n].split(' -')[-1])
    
Spain_dictionary = dict(zip(Q7_list, Spain_list))
ROW_dictionary = dict(zip(Q7_list, ROW_list))

Spain_dictionary_sorted = sorted(Spain_dictionary.items(), key=operator.itemgetter(1), reverse = True)
ROW_dictionary_sorted   = sorted(ROW_dictionary.items(), key=operator.itemgetter(1), reverse = True)

Spain_df = pd.DataFrame(Spain_dictionary_sorted, columns=['Language', 'count'])
ROW_df = pd.DataFrame(ROW_dictionary_sorted, columns=['Language', 'count'])

plt.figure(figsize=(15, 5))
plt.suptitle('Business Intelligence', fontsize=20)
plt.subplot(1, 2, 1)
sns.barplot(x="Language", y="count", data=Spain_df, color=Spain_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
sns.barplot(x="Language", y="count", data=ROW_df, color=ROW_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

Entre la gente dedicado a Business Intelligence los herramientas más populares son
* [Microsoft Power BI](https://powerbi.microsoft.com/)
* [Tableau](https://www.tableau.com/)

# <center style="background-color:Gainsboro; width:60%;">AutoML: Herramientas</center>

In [None]:
first_col = 205
n_parts = 8

ROW_list   =[]
Spain_list =[]
Q7_list    =[]

for n in range(first_col,first_col+n_parts):
    ROW_list.append(kaggle_ROW.iloc[:,n].count() )
    Spain_list.append(kaggle_Spain.iloc[:,n].count() )
    Q7_list.append(kaggle_survey_2021.iloc[0,n].split(' -')[-1])
    
Spain_dictionary = dict(zip(Q7_list, Spain_list))
ROW_dictionary = dict(zip(Q7_list, ROW_list))

Spain_dictionary_sorted = sorted(Spain_dictionary.items(), key=operator.itemgetter(1), reverse = True)
ROW_dictionary_sorted   = sorted(ROW_dictionary.items(), key=operator.itemgetter(1), reverse = True)

Spain_df = pd.DataFrame(Spain_dictionary_sorted, columns=['Language', 'count'])
ROW_df = pd.DataFrame(ROW_dictionary_sorted, columns=['Language', 'count'])

plt.figure(figsize=(15, 5))
plt.suptitle('AutoML', fontsize=20)
plt.subplot(1, 2, 1)
sns.barplot(x="Language", y="count", data=Spain_df, color=Spain_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
sns.barplot(x="Language", y="count", data=ROW_df, color=ROW_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

* Automated data augmentation (p.ej. [imgaug](https://imgaug.readthedocs.io/en/latest/), [albumentations](https://albumentations.ai/))
* Automated feature engineering/selection (p.eg. [tpot](https://epistasislab.github.io/tpot/), [boruta_py](https://github.com/scikit-learn-contrib/boruta_py))
* Automated model selection (e.g. [auto-sklearn](https://automl.github.io/auto-sklearn/master/), [xcessiv](https://github.com/reiinakano/xcessiv))
* Automated model architecture searches (e.g. [darts](https://github.com/quark0/darts), [enas](https://github.com/melodyguan/enas))
* Automated hyperparameter tuning (e.g. [hyperopt](https://hyperopt.github.io/hyperopt/), [ray.tune](https://www.ray.io/ray-tune), [Vizier](https://cloud.google.com/ai-platform/optimizer/docs/overview))
* Automation of full ML pipelines: ver el siguiente grafico

Se puede ver, por lo menos entre gente que usa kaggle, que AutoML tiene poco uso.

# <center style="background-color:Gainsboro; width:60%;">Full AutoML</center>

In [None]:
first_col = 213
n_parts = 8

ROW_list   =[]
Spain_list =[]
Q7_list    =[]

for n in range(first_col,first_col+n_parts):
    ROW_list.append(kaggle_ROW.iloc[:,n].count() )
    Spain_list.append(kaggle_Spain.iloc[:,n].count() )
    Q7_list.append(kaggle_survey_2021.iloc[0,n].split(' -')[-1])
    
Spain_dictionary = dict(zip(Q7_list, Spain_list))
ROW_dictionary = dict(zip(Q7_list, ROW_list))

Spain_dictionary_sorted = sorted(Spain_dictionary.items(), key=operator.itemgetter(1), reverse = True)
ROW_dictionary_sorted   = sorted(ROW_dictionary.items(), key=operator.itemgetter(1), reverse = True)

Spain_df = pd.DataFrame(Spain_dictionary_sorted, columns=['Language', 'count'])
ROW_df = pd.DataFrame(ROW_dictionary_sorted, columns=['Language', 'count'])

plt.figure(figsize=(15, 5))
plt.suptitle('Full AutoML', fontsize=20)
plt.subplot(1, 2, 1)
sns.barplot(x="Language", y="count", data=Spain_df, color=Spain_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
sns.barplot(x="Language", y="count", data=ROW_df, color=ROW_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

* [Amazon SageMaker Autopilot](https://aws.amazon.com/sagemaker/autopilot/)
* [Databricks AutoML](https://databricks.com/product/automl)
* [H2O Driverless AI](https://www.h2o.ai/products/h2o-driverless-ai/)
* [Google Cloud AutoML](https://cloud.google.com/automl/)
* [Microsoft Azure Automated machine learning](https://azure.microsoft.com/en-us/services/machine-learning/automatedml/)
* [DataRobot Automated Machine Learning](https://www.datarobot.com/platform/automated-machine-learning/)

# <center style="background-color:Gainsboro; width:60%;">ML experiment logging</center>

In [None]:
first_col = 221
n_parts = 12

ROW_list   =[]
Spain_list =[]
Q7_list    =[]

for n in range(first_col,first_col+n_parts):
    ROW_list.append(kaggle_ROW.iloc[:,n].count() )
    Spain_list.append(kaggle_Spain.iloc[:,n].count() )
    Q7_list.append(kaggle_survey_2021.iloc[0,n].split(' -')[-1])
    
Spain_dictionary = dict(zip(Q7_list, Spain_list))
ROW_dictionary = dict(zip(Q7_list, ROW_list))

Spain_dictionary_sorted = sorted(Spain_dictionary.items(), key=operator.itemgetter(1), reverse = True)
ROW_dictionary_sorted   = sorted(ROW_dictionary.items(), key=operator.itemgetter(1), reverse = True)

Spain_df = pd.DataFrame(Spain_dictionary_sorted, columns=['Language', 'count'])
ROW_df = pd.DataFrame(ROW_dictionary_sorted, columns=['Language', 'count'])

plt.figure(figsize=(15, 5))
plt.suptitle('Logging de experimentos', fontsize=20)
plt.subplot(1, 2, 1)
sns.barplot(x="Language", y="count", data=Spain_df, color=Spain_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
sns.barplot(x="Language", y="count", data=ROW_df, color=ROW_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

La mayoría de gente no usa un sistema formal de anotar sus experimentos, pero entre los que sí, los más populares son:

* [TensorBoard](https://www.tensorflow.org/tensorboard/)
* [MLflow](https://mlflow.org/)

# <center style="background-color:Gainsboro; width:60%;">Cursos de Data Science</center>

In [None]:
first_col = 243
n_parts = 12

ROW_list   =[]
Spain_list =[]
Q7_list    =[]

for n in range(first_col,first_col+n_parts):
    ROW_list.append(kaggle_ROW.iloc[:,n].count() )
    Spain_list.append(kaggle_Spain.iloc[:,n].count() )
    Q7_list.append(kaggle_survey_2021.iloc[0,n].split(' -')[-1])
    
Spain_dictionary = dict(zip(Q7_list, Spain_list))
ROW_dictionary = dict(zip(Q7_list, ROW_list))

Spain_dictionary_sorted = sorted(Spain_dictionary.items(), key=operator.itemgetter(1), reverse = True)
ROW_dictionary_sorted   = sorted(ROW_dictionary.items(), key=operator.itemgetter(1), reverse = True)

Spain_df = pd.DataFrame(Spain_dictionary_sorted, columns=['Language', 'count'])
ROW_df = pd.DataFrame(ROW_dictionary_sorted, columns=['Language', 'count'])

plt.figure(figsize=(15, 5))
plt.suptitle('Microaprendizaje', fontsize=20)
plt.subplot(1, 2, 1)
sns.barplot(x="Language", y="count", data=Spain_df, color=Spain_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
sns.barplot(x="Language", y="count", data=ROW_df, color=ROW_color)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

Las tres más populares son:
* [Coursera](https://www.coursera.org/)
* [Kaggle learn courses](https://www.kaggle.com/learn)
* [Udemy](https://www.udemy.com/)

# <center style="background-color:Gainsboro; width:60%;">Programas de análisis </center>

In [None]:
question_num = 'Q41'
plt.figure(figsize=(15, 5))
plt.suptitle('Programas de análisis', fontsize=22)
plt.subplot(1, 2, 1)
plt.title('España', fontsize=14)
sns.countplot(x=question_num, data=kaggle_Spain,  color=Spain_color, order = kaggle_Spain[question_num].value_counts().index)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.subplot(1, 2, 2)
plt.title('El resto del mundo', fontsize=14)
sns.countplot(x=question_num, data=kaggle_ROW,  color=ROW_color, order = kaggle_ROW[question_num].value_counts().index)
plt.tick_params(axis='x', rotation=90)
plt.xlabel("")
plt.show()

* Programas de estadística básica. p.ej. [Microsoft Excel](https://www.microsoft.com/en-us/microsoft-365/excel), [Google Sheets](https://www.google.com/sheets/about/), *etc*.
* Programas de estadística avanzada, p.ej. [SPSS](https://www.ibm.com/analytics/spss-statistics-software), [SAS](https://www.sas.com/), *etc*.
* Business intelligence software, p.ej. [Salesforce](https://www.salesforce.com), [Tableau](https://www.tableau.com/), [Spotfire](https://www.tibco.com/products/tibco-spotfire), *etc*.
* Entornos de desarrollo local, p.ej. [RStudio](https://www.rstudio.com/products/rstudio/), [JupyterLab](https://jupyter.org/), *etc*.
* Cloud-based data software & APIs, p.ej. [AWS](https://aws.amazon.com/), [GCP](https://console.cloud.google.com/getting-started), [Azure](https://azure.microsoft.com/), *etc*.