# Introduction à PySpark Dataframes

Ce notebook est conçu pour vous aider à démarrer avec Apache pySpark dataframes.



L'ensemble de données provient du référentiel [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Adult) et est fourni avec Databricks Runtime. 


Pour plus des informations https://sparkbyexamples.com/pyspark/pyspark-aggregate-functions/

# Étape 0. Chargez l'ensemble de données

Afficher les premières lignes des données.

Créer un schéma pour attribuer des noms de colonne et des types de données.

In [0]:
schema = """`age` DOUBLE,
`workclass` STRING,
`fnlwgt` DOUBLE,
`education` STRING,
`education_num` DOUBLE,
`marital_status` STRING,
`occupation` STRING,
`relationship` STRING,
`race` STRING,
`sex` STRING,
`capital_gain` DOUBLE,
`capital_loss` DOUBLE,
`hours_per_week` DOUBLE,
`native_country` STRING,
`income` STRING"""

dataset = spark.read.csv("/databricks-datasets/adult/adult.data", schema=schema)

#dataset = spark.read.csv("/databricks-datasets/adult/adult.data", header=True, inferSchema=True)
dataset.show()
print(dataset.count())

In [0]:
display(dataset)

age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
39.0,State-gov,77516.0,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K
50.0,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K
38.0,Private,215646.0,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K
53.0,Private,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K
28.0,Private,338409.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K
37.0,Private,284582.0,Masters,14.0,Married-civ-spouse,Exec-managerial,Wife,White,Female,0.0,0.0,40.0,United-States,<=50K
49.0,Private,160187.0,9th,5.0,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0.0,0.0,16.0,Jamaica,<=50K
52.0,Self-emp-not-inc,209642.0,HS-grad,9.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,45.0,United-States,>50K
31.0,Private,45781.0,Masters,14.0,Never-married,Prof-specialty,Not-in-family,White,Female,14084.0,0.0,50.0,United-States,>50K
42.0,Private,159449.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178.0,0.0,40.0,United-States,>50K


# Visualiser les données

In [0]:
display(dataset.select("hours_per_week").summary())

summary,hours_per_week
count,32561.0
mean,40.437455852093
stddev,12.347428681731838
min,1.0
25%,40.0
50%,40.0
75%,45.0
max,99.0


In [0]:
display(dataset.select("age").summary())


summary,age
count,32561.0
mean,38.58164675532078
stddev,13.640432553581356
min,17.0
25%,28.0
50%,37.0
75%,48.0
max,90.0


Faire un analyse exploratorie des données

In [0]:
Trier les données par age

In [0]:
display(dataset.filter(dataset.education_num== 13) \
    .sort("age") \
    .groupBy("marital_status")\
    .count())
    

marital_status,count
Widowed,82
Married-spouse-absent,68
Married-AF-spouse,4
Married-civ-spouse,2768
Divorced,546
Never-married,1795
Separated,92


Montrer les nombre de personnes selon le marital_status

Faire una diagramme cammenbert de nombre des personnes selon son education

d Montrer un histogram de l’âge selon le niveau d'education.

Montrer la distribution d'age de la population

In [0]:
display(dataset
        .groupBy("education")
        .count()
        .sort("count", ascending=False))

education,count
HS-grad,10501
Some-college,7291
Bachelors,5355
Masters,1723
Assoc-voc,1382
11th,1175
Assoc-acdm,1067
10th,933
7th-8th,646
Prof-school,576


Verifier si la population avec un niveau des études plus élevés a une major income

In [0]:
display(trainDF
        .groupBy("income")
        .count()
        .sort("count", ascending=False))

income,count
<=50K,19812
>50K,6264


In [0]:
display(trainDF.filter(trainDF.education_num == 13.0) \
    .groupBy("income")\
    .count())

income,count
>50K,1763
<=50K,2492


In [0]:
display(trainDF.filter(trainDF.education_num == 14.0) \
    .groupBy("income")\
    .count())

income,count
>50K,786
<=50K,602


### À vous
Explorer le jeux de données suivant https://github.com/nytimes/covid-19-data
et montrer au moins 5 representation graphiques petinentes et trois tableaus.

In [0]:
dataset = spark.read.csv("/databricks-datasets/COVID/covid-19-data/us-counties.csv", header=True, inferSchema=True)
dataset.show()
print(dataset.count())

Trier les donnees en ordre croisante de dates, et montrer le nombre de cases pour Los Angeles

In [0]:
(covid_df
 .sort(covid_df["date"].desc()) 
 .filter(covid_df["county"] == "Los Angeles"))

Pour une date donnée montrer le nombre de cas

In [0]:
https://databricks.com/fr/discover/introduction-to-data-analysis-workshop-series/intro-apache-spark