# Effectuez une prédiction de revenus

## Mission 1
Résumez les données utilisées :
- Année(s) des données utilisées
- Nombre de pays présents
- Population couverte par l'analyse (en termes de pourcentage de la population mondiale)
Les données de la world income distribution présentent pour chaque pays les quantiles de la
distribution des revenus de leur population respective.
- De quel type de quantiles s'agit-il (quartiles, déciles, etc.) ?
- Échantillonner une population en utilisant des quantiles est-il selon vous une bonne méthode? Pourquoi ?

## Mission 2

Montrez la diversité des pays en termes de distribution de revenus à l'aide d'un graphique.
- Celui-ci représentera le revenu moyen (axe des ordonnées, sur une échelle logarithmique) de chacune des classes de revenus (axe des abscisses) pour 5 à 10 pays que vous aurez choisis pour montrer la diversité des cas.
- Représentez la courbe de Lorenz de chacun des pays choisis.
- Pour chacun de ces pays, représentez l'évolution de l'indice de Gini au fil des ans.
- Classez les pays par indice de Gini. Donnez la moyenne, les 5 pays ayant l'indice de Gini le plus élevé et les 5 pays ayant l'indice de Gini le plus faible. En quelle position se trouve la France ?

## Mission 3
Dans l'état actuel, nous avons à disposition deux des trois variables explicatives souhaitées :
- le revenu moyen du pays
- l'indice de Gini du pays
Il nous manque donc, pour un individu , la classe de revenu de ses parents.

...

## Importation des données et des bibliothèques

In [1]:
# Importation des bibliothèques
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Importation des données
data = pd.read_csv('/Users/teilo/Desktop/OC - Projet 7/predictions_revenus/Données/data-projet7.csv')
data

Unnamed: 0,country,year_survey,quantile,nb_quantiles,income,gdpppp
0,ALB,2008,1,100,72889795,7297
1,ALB,2008,2,100,91666235,7297
2,ALB,2008,3,100,1010916,7297
3,ALB,2008,4,100,10869078,7297
4,ALB,2008,5,100,11326997,7297
...,...,...,...,...,...,...
11594,COD,2008,96,100,8106233,30319305
11595,COD,2008,97,100,9117834,30319305
11596,COD,2008,98,100,10578074,30319305
11597,COD,2008,99,100,12866029,30319305


## Nettoyage des données

In [3]:
data = data.rename(columns= {
    'country': 'Country',
    'year_survey': 'Year',
    'quantile': 'Quantile',
    'income': 'Income',
    'gdpppp': 'GDP'
})
data

Unnamed: 0,Country,Year,Quantile,nb_quantiles,Income,GDP
0,ALB,2008,1,100,72889795,7297
1,ALB,2008,2,100,91666235,7297
2,ALB,2008,3,100,1010916,7297
3,ALB,2008,4,100,10869078,7297
4,ALB,2008,5,100,11326997,7297
...,...,...,...,...,...,...
11594,COD,2008,96,100,8106233,30319305
11595,COD,2008,97,100,9117834,30319305
11596,COD,2008,98,100,10578074,30319305
11597,COD,2008,99,100,12866029,30319305


In [4]:
data.describe()

Unnamed: 0,Year,Quantile,nb_quantiles
count,11599.0,11599.0,11599.0
mean,2007.982757,50.500819,100.0
std,0.909633,28.868424,0.0
min,2004.0,1.0,100.0
25%,2008.0,25.5,100.0
50%,2008.0,51.0,100.0
75%,2008.0,75.5,100.0
max,2011.0,100.0,100.0


On remarque que les années prises en compte dans ces données vont de 2004 à 2011. Avec une année moyenne estimé à 2008.

On peut supprimer la colonne 'nb_quantiles' car les données sont toujours 100.

In [5]:
data = data.drop(columns='nb_quantiles')
data

Unnamed: 0,Country,Year,Quantile,Income,GDP
0,ALB,2008,1,72889795,7297
1,ALB,2008,2,91666235,7297
2,ALB,2008,3,1010916,7297
3,ALB,2008,4,10869078,7297
4,ALB,2008,5,11326997,7297
...,...,...,...,...,...
11594,COD,2008,96,8106233,30319305
11595,COD,2008,97,9117834,30319305
11596,COD,2008,98,10578074,30319305
11597,COD,2008,99,12866029,30319305


In [6]:
data3 = data.copy()

In [7]:
mean = data.groupby('Country').mean()
mean

Unnamed: 0_level_0,Year,Quantile
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
ALB,2008.0,50.5
ARG,2008.0,50.5
ARM,2008.0,50.5
AUT,2008.0,50.5
AZE,2008.0,50.5
...,...,...
VEN,2006.0,50.5
VNM,2006.0,50.5
XKX,2008.0,50.5
YEM,2008.0,50.5


116 pays sont représentés dans ce jeu de données.

### Match des code ISO avec les pays

In [8]:
code = pd.read_csv("/Users/teilo/Desktop/OC - Projet 7/predictions_revenus/Données/liste-197-etats-2020.csv",encoding='latin-1')
code = code['NOM;NOM_ALPHA;CODE;ARTICLE;NOM_LONG;CAPITALE'].str.split(";", n=3, expand=True)
code = code.drop(columns=[0,3])
code = code.rename(columns={
    1:'Name',
    2:'Country'
})

In [9]:
data = data.merge(code, on='Country', how='left')
#data = data.drop(columns='Name_x').rename(columns={'Name_y':'Name'})
data = data[['Name', 'Country', 'Year', 'GDP', 'Income', 'Quantile']]
data

Unnamed: 0,Name,Country,Year,GDP,Income,Quantile
0,Albanie,ALB,2008,7297,72889795,1
1,Albanie,ALB,2008,7297,91666235,2
2,Albanie,ALB,2008,7297,1010916,3
3,Albanie,ALB,2008,7297,10869078,4
4,Albanie,ALB,2008,7297,11326997,5
...,...,...,...,...,...,...
11594,Congo (République démocratique du),COD,2008,30319305,8106233,96
11595,Congo (République démocratique du),COD,2008,30319305,9117834,97
11596,Congo (République démocratique du),COD,2008,30319305,10578074,98
11597,Congo (République démocratique du),COD,2008,30319305,12866029,99


### Correspondance des données avec la population
Nous selectionnons les données de la population de 2008 (année moyenne) afin de pouvoir calculer la proportion de chaque pays sur le total de la population d'une année.

In [10]:
population = pd.read_csv("/Users/teilo/Desktop/OC - Projet 7/predictions_revenus/Données/FAOSTAT_data_population.csv")[['Zone', 'Valeur']]
population = population.rename(columns={'Zone':'Name'})
population['Name'] = population['Name'].str.replace("?","é")
population['Valeur'] = population['Valeur']*1000
population

  population['Name'] = population['Name'].str.replace("?","é")


Unnamed: 0,Name,Valeur
0,Afghanistan,27722276.0
1,Afrique du Sud,49779471.0
2,Albanie,3002678.0
3,Algérie,34730608.0
4,Allemagne,81065752.0
...,...,...
226,Venezuela (République bolivarienne du),27635832.0
227,Viet Nam,86243413.0
228,Yémen,21892146.0
229,Zambie,12848530.0


Suppression de la Chine

In [11]:
pop_total = population['Valeur'].sum()
print(f"""Population du Monde en 2008: {pop_total:,}""")
display(population[population['Name'].str.contains('Chine')])

# Suppression de la Chine
population = population[population.Name != 'Chine'].copy()
pop_total = population['Valeur'].sum()

print(f"""Population du Monde en 2008: {pop_total:,}""")

Population du Monde en 2008: 8,173,074,317.0


Unnamed: 0,Name,Valeur
40,Chine,1383986000.0
41,Chine - RAS de Hong-Kong,6881863.0
42,Chine - RAS de Macao,515239.0
43,"Chine, continentale",1353569000.0
44,"Chine, Taiwan Province de",23019040.0


Population du Monde en 2008: 6,789,088,686.0


### Calcul de la contribution de chaque pays à la population mondiale

In [12]:
# Calcul du % de la population d'un pays par rapport à la population mondiale
population['%_pop_total'] = population['Valeur']/population['Valeur'].sum()*100
population

Unnamed: 0,Name,Valeur,%_pop_total
0,Afghanistan,27722276.0,0.408336
1,Afrique du Sud,49779471.0,0.733228
2,Albanie,3002678.0,0.044228
3,Algérie,34730608.0,0.511565
4,Allemagne,81065752.0,1.194059
...,...,...,...
226,Venezuela (République bolivarienne du),27635832.0,0.407062
227,Viet Nam,86243413.0,1.270324
228,Yémen,21892146.0,0.322461
229,Zambie,12848530.0,0.189253


In [13]:
print(population.Name)

0                                 Afghanistan
1                              Afrique du Sud
2                                     Albanie
3                                     Algérie
4                                   Allemagne
                        ...                  
226    Venezuela (République bolivarienne du)
227                                  Viet Nam
228                                     Yémen
229                                    Zambie
230                                  Zimbabwe
Name: Name, Length: 230, dtype: object


In [14]:
population.loc[population['Name']== 'Congo']

Unnamed: 0,Name,Valeur,%_pop_total
48,Congo,4011486.0,0.059087


In [15]:
data = data.merge(population, on='Name', how='left')
data

Unnamed: 0,Name,Country,Year,GDP,Income,Quantile,Valeur,%_pop_total
0,Albanie,ALB,2008,7297,72889795,1,3002678.0,0.044228
1,Albanie,ALB,2008,7297,91666235,2,3002678.0,0.044228
2,Albanie,ALB,2008,7297,1010916,3,3002678.0,0.044228
3,Albanie,ALB,2008,7297,10869078,4,3002678.0,0.044228
4,Albanie,ALB,2008,7297,11326997,5,3002678.0,0.044228
...,...,...,...,...,...,...,...,...
11594,Congo (République démocratique du),COD,2008,30319305,8106233,96,,
11595,Congo (République démocratique du),COD,2008,30319305,9117834,97,,
11596,Congo (République démocratique du),COD,2008,30319305,10578074,98,,
11597,Congo (République démocratique du),COD,2008,30319305,12866029,99,,


In [16]:
data_na = data.groupby("Name").mean()

In [17]:
null_data = data_na[data_na.isnull().any(axis=1)]
print(null_data)

                                      Year  Quantile  Valeur  %_pop_total
Name                                                                     
Azerbaïdjan                         2008.0      50.5     NaN          NaN
Biélorussie                         2008.0      50.5     NaN          NaN
Bolivie                             2008.0      50.5     NaN          NaN
Burkina                             2009.0      50.5     NaN          NaN
Centrafrique                        2008.0      50.5     NaN          NaN
Chine                               2007.0      50.5     NaN          NaN
Congo (République démocratique du)  2008.0      50.5     NaN          NaN
Corée du Sud                        2008.0      50.5     NaN          NaN
Côte d'Ivoire                       2008.0      50.5     NaN          NaN
Dominicaine (République)            2008.0      50.5     NaN          NaN
Grèce                               2008.0      50.5     NaN          NaN
Guatémala                           20

In [18]:
population1 = pd.read_csv("/Users/teilo/Desktop/OC - Projet 7/predictions_revenus/Données/population1.csv", on_bad_lines='skip')
population1

Unnamed: 0,Country Name;Country Code;2008;
0,Aruba;ABW;101362;
1,Africa Eastern and Southern;AFE;491173160;
2,Afghanistan;AFG;27722281;
3,Africa Western and Central;AFW;331772330;
4,Angola;AGO;21695636;
...,...
248,Samoa;WSM;183270;
249,Kosovo;XKX;1747383;
250,South Africa;ZAF;49779472;
251,Zambia;ZMB;12848531;


In [19]:
population1 = population1['Country Name;Country Code;2008;'].str.split(";", n=3, expand=True)


In [20]:
population1 = population1.rename(columns={0:"Name", 1:'Country', 2:'Population'})
population1

Unnamed: 0,Name,Country,Population,3
0,Aruba,ABW,101362,
1,Africa Eastern and Southern,AFE,491173160,
2,Afghanistan,AFG,27722281,
3,Africa Western and Central,AFW,331772330,
4,Angola,AGO,21695636,
...,...,...,...,...
248,Samoa,WSM,183270,
249,Kosovo,XKX,1747383,
250,South Africa,ZAF,49779472,
251,Zambia,ZMB,12848531,


In [21]:
population1 = population1.drop(columns=3)
population1

Unnamed: 0,Name,Country,Population
0,Aruba,ABW,101362
1,Africa Eastern and Southern,AFE,491173160
2,Afghanistan,AFG,27722281
3,Africa Western and Central,AFW,331772330
4,Angola,AGO,21695636
...,...,...,...
248,Samoa,WSM,183270
249,Kosovo,XKX,1747383
250,South Africa,ZAF,49779472
251,Zambia,ZMB,12848531


In [22]:
data4 = data3.merge(population1, how='left', on='Country')
data4

Unnamed: 0,Country,Year,Quantile,Income,GDP,Name,Population
0,ALB,2008,1,72889795,7297,Albania,2947314
1,ALB,2008,2,91666235,7297,Albania,2947314
2,ALB,2008,3,1010916,7297,Albania,2947314
3,ALB,2008,4,10869078,7297,Albania,2947314
4,ALB,2008,5,11326997,7297,Albania,2947314
...,...,...,...,...,...,...,...
11594,COD,2008,96,8106233,30319305,,
11595,COD,2008,97,9117834,30319305,,
11596,COD,2008,98,10578074,30319305,,
11597,COD,2008,99,12866029,30319305,,


In [23]:
data5 = data3.merge(population1, how='inner', on='Country')
data5

Unnamed: 0,Country,Year,Quantile,Income,GDP,Name,Population
0,ALB,2008,1,72889795,7297,Albania,2947314
1,ALB,2008,2,91666235,7297,Albania,2947314
2,ALB,2008,3,1010916,7297,Albania,2947314
3,ALB,2008,4,10869078,7297,Albania,2947314
4,ALB,2008,5,11326997,7297,Albania,2947314
...,...,...,...,...,...,...,...
10894,ZAF,2008,96,24553568,9602,South Africa,49779472
10895,ZAF,2008,97,28858031,9602,South Africa,49779472
10896,ZAF,2008,98,3575029,9602,South Africa,49779472
10897,ZAF,2008,99,46297316,9602,South Africa,49779472


In [24]:
data5.loc[data5['Country'] == 'COD']

Unnamed: 0,Country,Year,Quantile,Income,GDP,Name,Population


In [25]:
null_data = data4[data4.isnull().any(axis=1)]
null_data = null_data.sort_values('GDP')
print(null_data)

      Country  Year  Quantile     Income    GDP                Name Population
4697      IRN  2008        98  20007,959  10446                 NaN        NaN
4625      IRN  2008        26  2494,5828  10446                 NaN        NaN
4626      IRN  2008        27  2565,6492  10446                 NaN        NaN
4627      IRN  2008        28  2633,4148  10446                 NaN        NaN
4628      IRN  2008        29    2699,84  10446                 NaN        NaN
...       ...   ...       ...        ...    ...                 ...        ...
11294     PSE  2009        96  2763,8848    NaN  West Bank and Gaza    3591977
11295     PSE  2009        97  3077,8333    NaN  West Bank and Gaza    3591977
11296     PSE  2009        98  3449,2224    NaN  West Bank and Gaza    3591977
11297     PSE  2009        99   4165,997    NaN  West Bank and Gaza    3591977
11298     PSE  2009       100  6343,8755    NaN  West Bank and Gaza    3591977

[900 rows x 7 columns]


In [26]:
data4.loc[data4['Country'] == 'COD'] = data4.loc[data4['Name'] == 'Congo'] | data4.loc[data4['Population'] == 60411195]
data4.loc[data4['Country'] == 'YEM'] = data4.loc[data4['Name'] == 'Yemen'] | data4.loc[data4['Population'] == 21892149]
data4.loc[data4['Country'] == 'VEN'] = data4.loc[data4['Name'] == 'Venezuela'] | data4.loc[data4['Population'] == 27635827]
data4.loc[data4['Country'] == 'KOR'] = data4.loc[data4['Name'] == 'Korea'] | data4.loc[data4['Population'] == 49054708]
data4.loc[data4['Country'] == 'IRN'] = data4.loc[data4['Name'] == 'Iran'] | data4.loc[data4['Population'] == 72120608]
data4.loc[data4['Country'] == 'EGY'] = data4.loc[data4['Name'] == 'Egypt'] | data4.loc[data4['Population'] == 79636081]
data4

Unnamed: 0,Country,Year,Quantile,Income,GDP,Name,Population
0,ALB,2008.0,1.0,72889795,7297,Albania,2947314
1,ALB,2008.0,2.0,91666235,7297,Albania,2947314
2,ALB,2008.0,3.0,1010916,7297,Albania,2947314
3,ALB,2008.0,4.0,10869078,7297,Albania,2947314
4,ALB,2008.0,5.0,11326997,7297,Albania,2947314
...,...,...,...,...,...,...,...
11594,,,,,,,
11595,,,,,,,
11596,,,,,,,
11597,,,,,,,


In [27]:
data4 = data4.dropna(subset=['Country'])


In [28]:
data4 = data4.set_index('Country')

In [29]:
data4 = data4.drop('TWN')

In [30]:
data4

Unnamed: 0_level_0,Year,Quantile,Income,GDP,Name,Population
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ALB,2008.0,1.0,72889795,7297,Albania,2947314
ALB,2008.0,2.0,91666235,7297,Albania,2947314
ALB,2008.0,3.0,1010916,7297,Albania,2947314
ALB,2008.0,4.0,10869078,7297,Albania,2947314
ALB,2008.0,5.0,11326997,7297,Albania,2947314
...,...,...,...,...,...,...
ZAF,2008.0,96.0,24553568,9602,South Africa,49779472
ZAF,2008.0,97.0,28858031,9602,South Africa,49779472
ZAF,2008.0,98.0,3575029,9602,South Africa,49779472
ZAF,2008.0,99.0,46297316,9602,South Africa,49779472


In [31]:
data4.loc[data4['Name'] == 'Congo']

Unnamed: 0_level_0,Year,Quantile,Income,GDP,Name,Population
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1


In [32]:
null_data = data4[data4.isnull().any(axis=1)]
null_data = null_data.sort_values('GDP')
print(null_data)

           Year Quantile     Income  GDP                Name Population
Country                                                                
XKX      2008.0      1.0   437,8937  NaN              Kosovo    1747383
XKX      2008.0      2.0  508,17133  NaN              Kosovo    1747383
XKX      2008.0      3.0   591,8282  NaN              Kosovo    1747383
XKX      2008.0      4.0        668  NaN              Kosovo    1747383
XKX      2008.0      5.0   730,4022  NaN              Kosovo    1747383
...         ...      ...        ...  ...                 ...        ...
PSE      2009.0     96.0  2763,8848  NaN  West Bank and Gaza    3591977
PSE      2009.0     97.0  3077,8333  NaN  West Bank and Gaza    3591977
PSE      2009.0     98.0  3449,2224  NaN  West Bank and Gaza    3591977
PSE      2009.0     99.0   4165,997  NaN  West Bank and Gaza    3591977
PSE      2009.0    100.0  6343,8755  NaN  West Bank and Gaza    3591977

[200 rows x 6 columns]
