# Análise do campo de ML por desenvolvimento do país

Machine Learning uma das área mais emergentes na computação, e diversas técnicas exigem amplo poder computacional. Queremos então estudar como a diferença no desenvolvimento de países, separados em desenvolvidos e emergentes, impactam nessa área - inclusive para o mercado de trabalho.

In [462]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from mlxtend.frequent_patterns import fpgrowth
import pysubgroup as ps
import numpy as np

np.seterr(divide='ignore', invalid='ignore')

df = pd.read_csv('./data/processed.csv')
df.head()

Unnamed: 0,Duration (in seconds),Q2,Q3,Q4,Q5,Q6_Coursera,Q6_edX,Q6_Kaggle Learn Courses,Q6_DataCamp,Q6_Fast.ai,...,Q44_Twitter (data science influencers),"Q44_Email newsletters (Data Elixir, O'Reilly Data & AI, etc)","Q44_Reddit (r/machinelearning, etc)","Q44_Kaggle (notebooks, forums, etc)","Q44_Course Forums (forums.fast.ai, Coursera forums, etc)","Q44_YouTube (Kaggle YouTube, Cloud AI Adventures, etc)","Q44_Podcasts (Chai Time Data Science, O’Reilly Data Show, etc)","Q44_Blogs (Towards Data Science, Analytics Vidhya, etc)","Q44_Journal Publications (peer-reviewed journals, conference proceedings, etc)","Q44_Slack Communities (ods.ai, kagglenoobs, etc)"
0,121,30-34,Man,India,No,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,462,30-34,Man,Algeria,No,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,293,18-21,Man,Egypt,Yes,1,1,0,1,0,...,1,1,0,1,0,1,1,0,0,0
3,851,55-59,Man,France,No,1,0,1,0,0,...,1,0,0,1,1,0,0,1,0,0
4,232,45-49,Man,India,Yes,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


## Dividindo países em desenvolvidos e emergentes/em desenvolvimento

Utilizando ranking do IMF https://www.imf.org/en/Publications/WEO/weo-database/2023/April/groups-and-aggregates

In [463]:
advanced_economies = {
    "Australia", "Belgium", "Canada", "Czech Republic", "France", "Germany",
    "Hong Kong (S.A.R.)", "Ireland", "Israel", "Italy", "Japan", "Netherlands",
    "Portugal", "Singapore", "South Korea", "Spain", "Taiwan",
    "United Kingdom of Great Britain and Northern Ireland",
    "United States of America"
}

emerging_and_developing_economies = {
    "Algeria", "Argentina", "Bangladesh", "Brazil", "Cameroon", "Chile",
    "China", "Colombia", "Ecuador", "Egypt", "Ethiopia", "Ghana", "India",
    "Indonesia", "Iran, Islamic Republic of", "Kenya", "Malaysia", "Mexico",
    "Morocco", "Nepal", "Nigeria", "Pakistan", "Peru", "Philippines",
    "Poland", "Romania", "Russia", "Saudi Arabia", "South Africa",
    "Sri Lanka", "Thailand", "Tunisia", "Turkey", "Ukraine",
    "United Arab Emirates", "Viet Nam", "Zimbabwe"
}

def set_tier(name):
    if name in advanced_economies:
        return 2
    elif name in emerging_and_developing_economies:
        return 1
    else: return 0

df['Country-status'] = df['Q4'].apply(set_tier)

## Filtrando DF para descoberta de subgrupos inicial

Como as perguntas feitas [dependem das respostas iniciais](./data/kaggle_survey_2022_answer_choices.pdf), faz-se primeiro a descoberta de subgrupos para as perguntas "gerais", ou perguntas não gerais que tendem a ter muitas respostas positivas (quantos anos programa).

Não são utilizados colunas que aceitam múltiplas resposta, pois geralmente atrapalham na descoberta. Essas poderão ser analisadas via padrões frequentes

In [464]:
def filter_df(df, questions):
    cols = [*df]
    selected = []
    for q in questions:
        #Get associated columns
        selected.extend([c for c in cols if c == q or c.startswith(q + '_')])
    return df[selected]

filtered = filter_df(df, ['Country-status', 'Q3', 'Q5', 'Q8', 'Q11', 'Q16'])
filtered.head()

Unnamed: 0,Country-status,Q3,Q5,Q8,Q11,Q16
0,1,Man,No,,,
1,1,Man,No,Master’s degree,1-3 years,Under 1 year
2,1,Man,Yes,Bachelor’s degree,1-3 years,1-2 years
3,2,Man,No,Some college/university study without earning ...,10-20 years,1-2 years
4,1,Man,Yes,Bachelor’s degree,5-10 years,I do not use machine learning methods


### Países desenvolvidos

Países desenvolvidos tendem a terem menos estudantes, pessoas com muito mais experiência, e homens. 

In [465]:
target = ps.BinaryTarget('Country-status', 2)
searchspace = ps.create_selectors(filtered, ignore=['Country-status'])
task = ps.SubgroupDiscoveryTask (
    filtered,
    target,
    searchspace,
    result_set_size=5,
    depth=4,
    qf=ps.WRAccQF())
result = ps.BeamSearch().execute(task)

for x in result.to_dataframe()['subgroup']:
    print(x)
result.to_dataframe()

Q5=='No'
Q3=='Man' AND Q5=='No'
Q11=='20+ years'
Q5=='No' AND Q8=='Master’s degree'
Q11=='20+ years' AND Q5=='No'


Unnamed: 0,quality,subgroup,size_sg,size_dataset,positives_sg,positives_dataset,size_complement,relative_size_sg,relative_size_complement,coverage_sg,coverage_complement,target_share_sg,target_share_complement,target_share_dataset,lift
0,0.050695,Q5=='No',12036,23997,4260,6068,11961,0.501563,0.498437,0.702044,0.297956,0.353938,0.151158,0.252865,1.399712
1,0.040044,Q3=='Man' AND Q5=='No',9444,23997,3349,6068,14553,0.393549,0.606451,0.551912,0.448088,0.354617,0.186834,0.252865,1.402396
2,0.023559,Q11=='20+ years',1537,23997,954,6068,22460,0.06405,0.93595,0.157218,0.842782,0.62069,0.227694,0.252865,2.454629
3,0.022681,Q5=='No' AND Q8=='Master’s degree',4962,23997,1799,6068,19035,0.206776,0.793224,0.296473,0.703527,0.362555,0.224271,0.252865,1.433791
4,0.02176,Q11=='20+ years' AND Q5=='No',1332,23997,859,6068,22665,0.055507,0.944493,0.141562,0.858438,0.644895,0.229826,0.252865,2.550353


### Países emergentes/em desenvolvimento

Tendem a possuir mais estudantes, com uma tendência maior de homens. A maioria possui pouca experiência, e na graduação.

In [466]:
target = ps.BinaryTarget('Country-status', 1)
searchspace = ps.create_selectors(filtered, ignore=['Country-status'])
task = ps.SubgroupDiscoveryTask (
    filtered,
    target,
    searchspace,
    result_set_size=5,
    depth=4,
    qf=ps.WRAccQF())
result = ps.BeamSearch().execute(task)

for x in result.to_dataframe()['subgroup']:
    print(x)
result.to_dataframe()

Q5=='Yes'
Q3=='Man' AND Q5=='Yes'
Q5=='Yes' AND Q8=='Bachelor’s degree'
Q8=='Bachelor’s degree'
Q11=='< 1 years'


Unnamed: 0,quality,subgroup,size_sg,size_dataset,positives_sg,positives_dataset,size_complement,relative_size_sg,relative_size_complement,coverage_sg,coverage_complement,target_share_sg,target_share_complement,target_share_dataset,lift
0,0.050216,Q5=='Yes',11961,23997,9348,16337,12036,0.498437,0.501563,0.572198,0.427802,0.78154,0.580675,0.680793,1.147984
1,0.039048,Q3=='Man' AND Q5=='Yes',8822,23997,6943,16337,15175,0.367629,0.632371,0.424986,0.575014,0.78701,0.619044,0.680793,1.156018
2,0.034146,Q5=='Yes' AND Q8=='Bachelor’s degree',4550,23997,3917,16337,19447,0.189607,0.810393,0.239763,0.760237,0.860879,0.638659,0.680793,1.264523
3,0.032627,Q8=='Bachelor’s degree',7625,23997,5974,16337,16372,0.317748,0.682252,0.365673,0.634327,0.783475,0.632971,0.680793,1.150827
4,0.027168,Q11=='< 1 years',5454,23997,4365,16337,18543,0.227278,0.772722,0.267185,0.732815,0.80033,0.645634,0.680793,1.175584


## Descoberta de subgrupos no mercado de trabalho de ML

Queremos descobrir subgrupos dentro daqueles empregados na área de ML, para podermos utilizar as questões relacionadas à indústria

In [503]:
employed = df[(df['Q5'] == 'No') & (df['Q23'] != 'Currently not employed')]
employed = filter_df(employed, ['Country-status', 'Q8', 'Q16', 'Q23', 'Q24', 'Q25', 'Q26', 'Q27', 'Q29', 'Q30']).dropna()
employed.head()

Unnamed: 0,Country-status,Q8,Q16,Q23,Q24,Q25,Q26,Q27,Q29,Q30
3,2,Some college/university study without earning ...,1-2 years,Data Scientist,Online Service/Internet-based Services,0-49 employees,1-2,"We recently started using ML methods (i.e., mo...","25,000-29,999","$1000-$9,999"
7,2,Bachelor’s degree,4-5 years,Software Engineer,Insurance/Risk Assessment,250-999 employees,20+,"We have well established ML methods (i.e., mod...","100,000-124,999",$0 ($USD)
8,2,Doctoral degree,5-10 years,Research Scientist,Government/Public Service,"1000-9,999 employees",20+,"We recently started using ML methods (i.e., mo...","100,000-124,999",$100-$999
13,2,Doctoral degree,5-10 years,Developer Advocate,Computers/Technology,"1000-9,999 employees",20+,"We have well established ML methods (i.e., mod...","200,000-249,999",$100-$999
16,2,Master’s degree,5-10 years,Data Scientist,Computers/Technology,"1000-9,999 employees",3-4,"We have well established ML methods (i.e., mod...","200,000-249,999","$100,000 or more ($USD)"


### Países desenvolvidos

Na maioria dos empregados, há uma forte tendência a usar métodos de ML estabelecidos, assim como altos salários e doutorados.

In [506]:
target = ps.BinaryTarget('Country-status', 2)
searchspace = ps.create_selectors(employed, ignore=['Country-status', 'Q4'])
task = ps.SubgroupDiscoveryTask (
    employed,
    target,
    searchspace,
    result_set_size=10,
    depth=4,
    qf=ps.WRAccQF())
result = ps.BeamSearch().execute(task)

for x in result.to_dataframe()['subgroup']:
    print(x)
result.to_dataframe()

Q27=='We have well established ML methods (i.e., models in production for more than 2 years)'
Q29=='150,000-199,999'
Q16=='5-10 years'
Q29=='100,000-124,999'
Q8=='Doctoral degree'
Q26=='20+'
Q26=='20+' AND Q27=='We have well established ML methods (i.e., models in production for more than 2 years)'
Q29=='125,000-149,999'
Q27=='We have well established ML methods (i.e., models in production for more than 2 years)' AND Q8=='Doctoral degree'
Q16=='5-10 years' AND Q27=='We have well established ML methods (i.e., models in production for more than 2 years)'


Unnamed: 0,quality,subgroup,size_sg,size_dataset,positives_sg,positives_dataset,size_complement,relative_size_sg,relative_size_complement,coverage_sg,coverage_complement,target_share_sg,target_share_complement,target_share_dataset,lift
0,0.026286,Q27=='We have well established ML methods (i.e...,1597,7409,822,2910,5812,0.215549,0.784451,0.282474,0.717526,0.514715,0.359257,0.392766,1.310489
1,0.023055,"Q29=='150,000-199,999'",334,7409,302,2910,7075,0.04508,0.95492,0.10378,0.89622,0.904192,0.368622,0.392766,2.302115
2,0.022001,Q16=='5-10 years',802,7409,478,2910,6607,0.108247,0.891753,0.164261,0.835739,0.59601,0.368094,0.392766,1.51747
3,0.021841,"Q29=='100,000-124,999'",390,7409,315,2910,7019,0.052639,0.947361,0.108247,0.891753,0.807692,0.369711,0.392766,2.056423
4,0.021014,Q8=='Doctoral degree',1284,7409,660,2910,6125,0.173303,0.826697,0.226804,0.773196,0.514019,0.367347,0.392766,1.308716
5,0.020404,Q26=='20+',1904,7409,899,2910,5505,0.256985,0.743015,0.308935,0.691065,0.472164,0.365304,0.392766,1.202152
6,0.017306,Q26=='20+' AND Q27=='We have well established ...,860,7409,466,2910,6549,0.116075,0.883925,0.160137,0.839863,0.54186,0.373187,0.392766,1.379603
7,0.017043,"Q29=='125,000-149,999'",259,7409,228,2910,7150,0.034957,0.965043,0.078351,0.921649,0.880309,0.375105,0.392766,2.241309
8,0.015294,Q27=='We have well established ML methods (i.e...,320,7409,239,2910,7089,0.043191,0.956809,0.082131,0.917869,0.746875,0.376781,0.392766,1.90158
9,0.01394,Q16=='5-10 years' AND Q27=='We have well estab...,371,7409,249,2910,7038,0.050074,0.949926,0.085567,0.914433,0.671159,0.37809,0.392766,1.708803


### Países subdesenvolvidos

A maioria dos empregados é extremamente novo na área, não chegando a usar ML em seus trabalhos. Também há um destaque maior de graduandos, assim como o baixo gasto em serviços na nuvem.

In [508]:
target = ps.BinaryTarget('Country-status', 1)
searchspace = ps.create_selectors(employed, ignore=['Country-status', 'Q4'])
task = ps.SubgroupDiscoveryTask (
    employed,
    target,
    searchspace,
    result_set_size=10,
    depth=4,
    qf=ps.WRAccQF())
result = ps.BeamSearch().execute(task)

for x in result.to_dataframe()['subgroup']:
    print(x)
result.to_dataframe()

Q16=='Under 1 year'
Q29=='$0-999'
Q8=='Bachelor’s degree'
Q29=='10,000-14,999'
Q27=='No (we do not use ML methods)'
Q29=='1,000-1,999'
Q30=='$1-$99'
Q16=='Under 1 year' AND Q8=='Master’s degree'
Q29=='$0-999' AND Q30=='$0 ($USD)'
Q24=='Computers/Technology'


Unnamed: 0,quality,subgroup,size_sg,size_dataset,positives_sg,positives_dataset,size_complement,relative_size_sg,relative_size_complement,coverage_sg,coverage_complement,target_share_sg,target_share_complement,target_share_dataset,lift
0,0.029428,Q16=='Under 1 year',1707,7409,1138,3993,5702,0.230395,0.769605,0.284999,0.715001,0.666667,0.500702,0.538939,1.236998
1,0.02897,Q29=='$0-999',921,7409,711,3993,6488,0.124308,0.875692,0.178062,0.821938,0.771987,0.505857,0.538939,1.43242
2,0.022779,Q8=='Bachelor’s degree',1765,7409,1120,3993,5644,0.238224,0.761776,0.280491,0.719509,0.634561,0.509036,0.538939,1.177426
3,0.018877,"Q29=='10,000-14,999'",453,7409,384,3993,6956,0.061142,0.938858,0.096168,0.903832,0.847682,0.518833,0.538939,1.572872
4,0.01728,Q27=='No (we do not use ML methods)',1553,7409,965,3993,5856,0.20961,0.79039,0.241673,0.758327,0.621378,0.517077,0.538939,1.152965
5,0.016535,"Q29=='1,000-1,999'",385,7409,330,3993,7024,0.051964,0.948036,0.082645,0.917355,0.857143,0.521498,0.538939,1.590426
6,0.01653,Q30=='$1-$99',1272,7409,808,3993,6137,0.171683,0.828317,0.202354,0.797646,0.63522,0.518983,0.538939,1.178649
7,0.015027,Q16=='Under 1 year' AND Q8=='Master’s degree',749,7409,515,3993,6660,0.101093,0.898907,0.128976,0.871024,0.687583,0.522222,0.538939,1.275809
8,0.014456,Q29=='$0-999' AND Q30=='$0 ($USD)',421,7409,334,3993,6988,0.056823,0.943177,0.083646,0.916354,0.793349,0.523612,0.538939,1.472057
9,0.013974,Q24=='Computers/Technology',1951,7409,1155,3993,5458,0.263328,0.736672,0.289256,0.710744,0.592004,0.519971,0.538939,1.098462
