BDLE 2022

date du document  :  07/10/2022

# TP2 et 3 Fenêtres


SQL avec fonctions sur des fenêtres

## Préparation

Vérifier que des ressources de calcul sont allouées à votre notebook est connecté (cf RAM  de disque indiqués en haut à droite) . Sinon cliquer sur le bouton connecter pour obtenir des ressources.




Pour accéder directement aux fichiers stockées sur votre google drive. Renseigner le code d'authentification lorsqu'il est demandé

Ajuster le nom de votre dossier : MyDrive/ens/bdle/DM1

In [1]:
# import os
# from google.colab import drive
# drive.mount("/content/drive")

# drive_dir = "/content/drive/MyDrive/ens/bdle/TP1"
# os.makedirs(drive_dir, exist_ok=True)
# os.listdir(drive_dir)

Installer pyspark et findspark :


In [2]:
!pip install -q pyspark
!pip install -q findspark

[K     |████████████████████████████████| 281.3 MB 49 kB/s 
[K     |████████████████████████████████| 199 kB 63.8 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


Démarrer la session spark

In [3]:
import os

# !find /usr/local/lib -name "pyspark"
os.environ["SPARK_HOME"] = "/usr/local/lib/python3.7/dist-packages/pyspark"
os.environ["JAVA_HOME"] = "/usr"

In [4]:
# Principaux import
import findspark
from pyspark.sql import SparkSession 
from pyspark import SparkConf  

# pour les dataframe et udf
from pyspark.sql import *  
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import *

# pour le chronomètre
import time

# initialise les variables d'environnement pour spark
findspark.init()

# Démarrage session spark 
# --------------------------
def demarrer_spark():
  local = "local[*]"
  appName = "TP"
  configLocale = SparkConf().setAppName(appName).setMaster(local).\
  set("spark.executor.memory", "6G").\
  set("spark.driver.memory","6G").\
  set("spark.sql.catalogImplementation","in-memory")
  
  spark = SparkSession.builder.config(conf = configLocale).getOrCreate()
  sc = spark.sparkContext
  sc.setLogLevel("ERROR")
  
  spark.conf.set("spark.sql.autoBroadcastJoinThreshold","-1")

  # On ajuste l'environnement d'exécution des requêtes à la taille du cluster (4 coeurs)
  spark.conf.set("spark.sql.shuffle.partitions","4")    
  print("session démarrée, son id est ", sc.applicationId)
  return spark
spark = demarrer_spark()

session démarrée, son id est  local-1665951806768


In [5]:
# on utilise 8 partitions au lieu de 200 par défaut
spark.conf.set("spark.sql.shuffle.partitions", "8")
print("Nombre de partitions utilisées : ", spark.conf.get("spark.sql.shuffle.partitions"))

Nombre de partitions utilisées :  8


In [6]:
# Optionnel :
# pour l'accès à spark UI : voir https://www.analyticsvidhya.com/blog/2020/11/a-must-read-guide-on-how-to-work-with-pyspark-on-google-colab-for-data-scientists/
# !wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
# !unzip ngrok-stable-linux-amd64.zip
# get_ipython().system_raw('./ngrok http 4050 &')
# !curl -s http://localhost:4040/api/tunnels

Redéfinir la fonction **display** pour afficher le résultat des requêtes dans un tableau

In [7]:
import pandas as pd
from google.colab import data_table

# alternatives to Databricks display function.

def display(df, n=100):
  return data_table.DataTable(df.limit(n).toPandas(), include_index=False, num_rows_per_page=10)

def display2(df, n=20):
  pd.set_option('max_columns', None)
  pd.set_option('max_colwidth', None)
  return df.limit(n).toPandas()


Définir le tag **%%sql** pour pouvoir écrire plus simplement des requêtes en SQL dans une cellule

In [8]:
from IPython.core.magic import (register_line_magic, register_cell_magic, register_line_cell_magic)

def removeComments(query):
  result = ""
  for line in query.split('\n'):
    if not(line.strip().startswith("--")):
      result += line + "\n"
  return result

@register_line_cell_magic
def sql(line, cell=None):
    "To run a sql query. Use:  %%sql"
    val = cell if cell is not None else line
    tabRequetes = removeComments(val).split(";")
    derniere = None
    est_requete = False
    for r in tabRequetes:
        r = r.strip()
        if len(r) > 2:
          derniere = spark.sql(r)
          est_requete = r.lower().startswith('select')
    if(est_requete):
      return display(derniere)
    else:
      return print('ok')

In [9]:
# facultatif (à ne pas utiliser)
# %load_ext google.colab.data_table
# %unload_ext google.colab.data_table

## Accès aux données

### URL pour l'accès aux datasets

In [10]:
# URL du dossier PUBLIC_DATASET contenant des fichiers de données pour les TP
# ---------------------------------------------------------------------------
# en cas de problème avec le téléchargement des datasets, aller directement sur l'URL ci-dessous
PUBLIC_DATASET_URL = "https://nuage.lip6.fr/s/H3bpyRGgnCq2NR4" 
PUBLIC_DATASET=PUBLIC_DATASET_URL + "/download?path="

print("URL du dossier contenant les datasets ", PUBLIC_DATASET_URL)

URL du dossier contenant les datasets  https://nuage.lip6.fr/s/H3bpyRGgnCq2NR4


### Données de mobilité

Données issues du dataset YFCC

In [11]:
local_dir = "/local/data"
os.makedirs(local_dir, exist_ok=True)
os.listdir(local_dir)

[]

In [12]:
from urllib import request

# download dataset if not already donwloaded
def download_file(web_dir, local_dir, file):
  local_file = local_dir + "/" + file
  web_file = web_dir + "/" + file
  if(os.path.isfile(local_file)):
    print(file, "is already stored")
  else:
    print("downloading from URL: ", web_file , "save in : " + local_file)
    request.urlretrieve(web_file , local_file)

# user visits
web_dir = PUBLIC_DATASET + "YFCC_POI_dataset_K_H_LIM/dataset_IJCAI_2015/data-ijcai15/userVisits-ijcai15"
download_file(web_dir, local_dir, "userVisits-Toro.csv")

# poi 
web_dir = PUBLIC_DATASET + "YFCC_POI_dataset_K_H_LIM/dataset_IJCAI_2015/data-ijcai15/poiList-ijcai15"
download_file(web_dir, local_dir, "POI-Toro.csv")



os.listdir(local_dir)

downloading from URL:  https://nuage.lip6.fr/s/H3bpyRGgnCq2NR4/download?path=YFCC_POI_dataset_K_H_LIM/dataset_IJCAI_2015/data-ijcai15/userVisits-ijcai15/userVisits-Toro.csv save in : /local/data/userVisits-Toro.csv
downloading from URL:  https://nuage.lip6.fr/s/H3bpyRGgnCq2NR4/download?path=YFCC_POI_dataset_K_H_LIM/dataset_IJCAI_2015/data-ijcai15/poiList-ijcai15/POI-Toro.csv save in : /local/data/POI-Toro.csv


['userVisits-Toro.csv', 'POI-Toro.csv']

### Les visites

On considère un fichier décrivant les check-ins d'utilisateurs et leur déplacements. Une ligne correspond à une photo prise par un utilisateur. On connait le lieu et la date d'une photo. 
Les photos consécutives durant une journée forment une séquence dont on connait l'identifiant (seqID). 

Lire les 2 premières lignes du  fichier csv en python. 
Est ce que le fichier a une ligne d'entête ?
Quel caractère délimite deux valeurs consécutives dans une ligne de données ?

In [None]:
f = open(local_dir + "/" + "userVisits-Toro.csv", "r")
print(f.readline()); print(f.readline())

"photoID";"userID";"dateTaken";"poiID";"poiTheme";"poiFreq";"seqID"

7941504100;"10007579@N00";1346844688;30;"Structure";1538;1



Lire le fichier des visites *sans* préciser le type des attributs.
Par défaut, tous les attributs sont considérés comme étant de type string.

In [None]:
user_visits = spark.read.option("header", "True").option("delimiter", ";").format("csv").load(local_dir + "/" + "userVisits-Toro.csv")
user_visits.show(3)
user_visits.printSchema()

+----------+------------+----------+-----+---------+-------+-----+
|   photoID|      userID| dateTaken|poiID| poiTheme|poiFreq|seqID|
+----------+------------+----------+-----+---------+-------+-----+
|7941504100|10007579@N00|1346844688|   30|Structure|   1538|    1|
|4886005532|10012675@N05|1142731848|    6| Cultural|    986|    2|
|4886006468|10012675@N05|1142732248|    6| Cultural|    986|    2|
+----------+------------+----------+-----+---------+-------+-----+
only showing top 3 rows

root
 |-- photoID: string (nullable = true)
 |-- userID: string (nullable = true)
 |-- dateTaken: string (nullable = true)
 |-- poiID: string (nullable = true)
 |-- poiTheme: string (nullable = true)
 |-- poiFreq: string (nullable = true)
 |-- seqID: string (nullable = true)



Lire le fichier en précisant le schéma : nom et type des attributs

In [None]:
schema = "photoID long, userID String, date Long, poiID int, poiTheme String, poiFreq int, seqID int"

user_visits = spark.read.option("header", "True").option("delimiter", ";").csv(local_dir + "/" + "userVisits-Toro.csv", schema = schema)
user_visits.persist()
user_visits.createOrReplaceTempView("user_visits")
user_visits.show(4)
user_visits.printSchema()
display(user_visits)

+----------+------------+----------+-----+---------+-------+-----+
|   photoID|      userID|      date|poiID| poiTheme|poiFreq|seqID|
+----------+------------+----------+-----+---------+-------+-----+
|7941504100|10007579@N00|1346844688|   30|Structure|   1538|    1|
|4886005532|10012675@N05|1142731848|    6| Cultural|    986|    2|
|4886006468|10012675@N05|1142732248|    6| Cultural|    986|    2|
|4885404441|10012675@N05|1142732373|    6| Cultural|    986|    2|
+----------+------------+----------+-----+---------+-------+-----+
only showing top 4 rows

root
 |-- photoID: long (nullable = true)
 |-- userID: string (nullable = true)
 |-- date: long (nullable = true)
 |-- poiID: integer (nullable = true)
 |-- poiTheme: string (nullable = true)
 |-- poiFreq: integer (nullable = true)
 |-- seqID: integer (nullable = true)



Unnamed: 0,photoID,userID,date,poiID,poiTheme,poiFreq,seqID
0,7941504100,10007579@N00,1346844688,30,Structure,1538,1
1,4886005532,10012675@N05,1142731848,6,Cultural,986,2
2,4886006468,10012675@N05,1142732248,6,Cultural,986,2
3,4885404441,10012675@N05,1142732373,6,Cultural,986,2
4,4886008334,10012675@N05,1142732445,6,Cultural,986,2
...,...,...,...,...,...,...,...
95,2654929774,10014440@N06,1215593613,25,Shopping,1701,10
96,2654104231,10014440@N06,1215593634,25,Shopping,1701,10
97,2654930912,10014440@N06,1215593650,25,Shopping,1701,10
98,2654105249,10014440@N06,1215593655,25,Shopping,1701,10


Nombre de photos, nombre de séquences et nombre d'évenements *un user a visité un POI* déterminés par un couple (séquence, POI)

In [None]:
%%sql
select count(*) as nbPhotos, count(distinct seqID) as nbSequences, count(distinct seqID, poiID) as nb_visites
from user_visits

Unnamed: 0,nbPhotos,nbSequences,nb_visites
0,39419,6057,7607


Les POI avec la plus grande frequence précalculée *poiFreq*

In [None]:
%%sql
select distinct poiID, poiFreq 
from user_visits
order by poiFreq desc

Unnamed: 0,poiID,poiFreq
0,11,4142
1,22,3619
2,21,3594
3,16,3553
4,1,3506
5,4,3056
6,7,2064
7,23,1874
8,8,1736
9,25,1701


Les POI avec la plus grande fréquence (ici la "fréquence bis" est le nombre de photos prises à un POI)

In [None]:
%%sql
select poiID, max(poiFreq) as poiFreq, count(*) as poiFreqBis
from user_visits
group by poiID
order by poiFreqBis desc

Unnamed: 0,poiID,poiFreq,poiFreqBis
0,11,4142,4139
1,22,3619,3603
2,21,3594,3591
3,16,3553,3553
4,1,3506,3506
5,4,3056,3056
6,7,2064,2053
7,23,1874,1866
8,8,1736,1736
9,25,1701,1701


### Les lieux visités : POI
Ils sont appelés *Point Of Interest*

In [None]:
poi_schema = "poiID long, poiName String, latitude double, longitude double, theme String"

poi = spark.read.option("header", "True").option("delimiter", ";").csv(local_dir + "/" + "POI-Toro.csv", schema = poi_schema)
poi.show(3)
poi.printSchema()
poi.createOrReplaceTempView("POI")

+-----+------------------+--------+---------+-----+
|poiID|           poiName|latitude|longitude|theme|
+-----+------------------+--------+---------+-----+
|    1| Air_Canada_Centre|43.64333|-79.37917|Sport|
|    2|         BMO_Field|43.63278|-79.41861|Sport|
|    3|Maple_Leaf_Gardens|43.66222|-79.38028|Sport|
+-----+------------------+--------+---------+-----+
only showing top 3 rows

root
 |-- poiID: long (nullable = true)
 |-- poiName: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- theme: string (nullable = true)



In [None]:
%%sql
cache table POI;

SELECT * 
FROM POI

Unnamed: 0,poiID,poiName,latitude,longitude,theme
0,1,Air_Canada_Centre,43.64333,-79.37917,Sport
1,2,BMO_Field,43.63278,-79.41861,Sport
2,3,Maple_Leaf_Gardens,43.66222,-79.38028,Sport
3,4,Rogers_Centre,43.64139,-79.38917,Sport
4,5,Woodbine_Racetrack,43.712525,-79.602042,Sport
5,6,Art_Gallery_of_Ontario,43.65389,-79.39278,Cultural
6,7,Hockey_Hall_of_Fame,43.646976,-79.377253,Cultural
7,8,Ripley%27s_Aquarium_of_Canada,43.642481,-79.38605,Cultural
8,9,Ontario_Science_Centre,43.71667,-79.33833,Cultural
9,10,Riverdale_Farm,43.667111,-79.361294,Cultural


la liste des thèmes

In [None]:
%%sql
create or replace temp view themes as
select distinct theme
from POI
order by theme;

cache table themes;

select *
from themes

Unnamed: 0,theme
0,Amusement
1,Beach
2,Cultural
3,Shopping
4,Sport
5,Structure


## Exercice 1

#### 1) Identifier les thèmes

Définir la table Theme1(theme_id, name) avec les numéros de thème commençant à 1 pour les noms de thème triés dans l'ordre croissant.

In [None]:
%%sql
create or replace temp view Theme1 as

select  rank() over (order by theme) as theme_id , theme as name
from themes ;

select * from Theme1

Unnamed: 0,theme_id,name
0,1,Amusement
1,2,Beach
2,3,Cultural
3,4,Shopping
4,5,Sport
5,6,Structure


#### 2) Classement des séquences par leur plus grand nombre de POI **distincts**

Définir la table TopSeq(seqID, nbPOI, rang) avec rang=1 pour la séquence ayant le plus grand nbPOI.


In [None]:
%%sql

create or replace temp view TopSeq as

select seqID, count(distinct poiID),rank() over ( order by count(distinct poiID) desc) as rang
from user_visits
group by seqID;


select * from TopSeq

Unnamed: 0,seqID,count(DISTINCT poiID),rang
0,298,13,1
1,4961,10,2
2,4351,9,3
3,5964,9,3
4,5369,8,5
...,...,...,...
95,5183,4,60
96,5539,4,60
97,813,4,60
98,2340,4,60


#### 2) Identifier les check-ins




2a) Définir la table Visite1 telle qu'il n'y ait pas de doublons sur le triplet (seqID, poiID, date)

In [None]:
%%sql
create or replace temp view Visite1 as
select seqID, poiID, date
from user_visits
Group by  seqID, poiID, date
;



select *
from Visite1
where seqID in (298, 510)
order by seqID, date, poiID;

Unnamed: 0,seqID,poiID,date
0,298,22,1371514467
1,298,7,1371514471
2,298,7,1371514472
3,298,23,1371516913
4,298,28,1371516914
5,298,28,1371516915
6,298,1,1371517315
7,298,29,1371517318
8,298,30,1371517318
9,298,30,1371517319


2b) Montrer qu'il existe des séquences où une même date est associée à plusieurs POI distincts. 
Indication : utiliser une requete de regroupement.

In [None]:
%%sql
select seqID, date , count(distinct(poiID)) as nb_POI_meme_date
from Visite1
group by seqID, date
having count(distinct(poiID)) >= 2
order by  nb_POI_meme_date desc, SeqID ; 


Unnamed: 0,seqID,date,nb_POI_meme_date
0,816,1291341600,3
1,822,1292896800,3
2,905,1317952800,3
3,1225,228664800,3
4,4271,1092646926,3
5,5497,1169172000,3
6,271,778341600,2
7,298,1371519708,2
8,298,1371519010,2
9,298,1371517318,2


2c) A partir de la table précédente, définir la table Visite2(seqID, poiID, date, num) avec *num* étant le numéro d'ordre dans une séquence.
Les POI visités à une même date sont triés dans l'ordre croissant de poiID.

In [None]:
%%sql

create or replace temp view Visite2 as
select seqID, poiID, date , rank() over (partition by seqID order by date, poiID) as num
from Visite1

;

select * from Visite2
where seqID in (298, 510)
order by seqID, num

Unnamed: 0,seqID,poiID,date,num
0,298,22,1371514467,1
1,298,7,1371514471,2
2,298,7,1371514472,3
3,298,23,1371516913,4
4,298,28,1371516914,5
5,298,28,1371516915,6
6,298,1,1371517315,7
7,298,29,1371517318,8
8,298,30,1371517318,9
9,298,30,1371517319,10


#### 3) Identifier les Visites de POI

Définir la table Visite3(userID, seqID, poiID, poiPosition) telle que la  poiPosition vaut *i* pour le ième POI visité dans une séquence.

Indications:

La visite d'un POI correspond à toutes les photos consécutives prises à ce POI.

Un même POI peut apparaitre **plusieurs** fois à des positions différentes dans une séquence si au moins un autre POI a été visité entre temps dans la séquence.


##### 3a) POI précédent
Commencer par associer chaque événement avec le POI précédent dans la séquence.
Indication penser aux fonctions lag() ou first()

In [None]:
%%sql
create or replace temp view Visite3a as

select seqID, date, poiID, num, first(poiID) OVER(ORDER BY seqID,date ROWS BETWEEN 1 PRECEDING AND  current row ) as precedant
from Visite2;
 

select * from Visite3a
where seqID in (298, 510)
order by seqID, date;


Unnamed: 0,seqID,date,poiID,num,precedant
0,298,1371514467,22,1,21
1,298,1371514471,7,2,22
2,298,1371514472,7,3,7
3,298,1371516913,23,4,7
4,298,1371516914,28,5,23
5,298,1371516915,28,6,28
6,298,1371517315,1,7,28
7,298,1371517318,29,8,1
8,298,1371517318,30,9,29
9,298,1371517319,30,10,30


##### 3b) Début de visite
Ajouter un attribut *début* valant 1 pour le premier tuple d'une série de photos consécutives concernant le même POI et 0 sinon. 
Indication, penser à la syntaxe case when ...then ... else ... end

In [None]:
%%sql
create or replace temp view Visite3b as

select seqID, date, poiID, num, precedant, 
CASE
    WHEN precedant = poiID THEN 0
    ELSE 1
END AS debut
from Visite3a;


select * from Visite3b
where seqID in (298, 510)
order by seqID, date;

Unnamed: 0,seqID,date,poiID,num,precedant,debut
0,298,1371514467,22,1,21,1
1,298,1371514471,7,2,22,1
2,298,1371514472,7,3,7,0
3,298,1371516913,23,4,7,1
4,298,1371516914,28,5,23,1
5,298,1371516915,28,6,28,0
6,298,1371517315,1,7,28,1
7,298,1371517318,29,8,1,1
8,298,1371517318,30,9,29,1
9,298,1371517319,30,10,30,0


##### 3c) Ordonner les POI visités
Définir la table Visite3 décrite au début de la question 3). Ajouter l'attribut poiPosition 

In [None]:
%%sql
create or replace temp view Visite3 as

select seqID, date,poiID, num, sum(debut) over (partition by seqID order by num) as poiPosition
from Visite3b;


select * from Visite3
where seqID in (298, 510)
order by seqID, date, poiID;

Unnamed: 0,seqID,date,poiID,num,poiPosition
0,298,1371514467,22,1,1
1,298,1371514471,7,2,2
2,298,1371514472,7,3,2
3,298,1371516913,23,4,3
4,298,1371516914,28,5,4
5,298,1371516915,28,6,4
6,298,1371517315,1,7,5
7,298,1371517318,29,8,6
8,298,1371517318,30,9,7
9,298,1371517319,30,10,7


#### 4) Durée de visite d'un POI
Définir la table Visite4(seqID, poiPosition, poiID, duree) comme étant la différence entre la plus grande et la plus petite date des photos consécutives associées à un même POI.

In [None]:
def my_func(liste):
  return liste[-1] - liste[0]

spark.udf.register("my_func", my_func)


<function __main__.my_func(liste)>

In [None]:
%%sql

create or replace temp view Visite4 as

SELECT seqID, poiPosition,poiID, my_func(collect_list(date)) as duree
FROM Visite3
GROUP BY seqID, poiPosition,poiID;

select * from Visite4
where seqID in (298, 510)
order by seqID, poiPosition

Unnamed: 0,seqID,poiPosition,poiID,duree
0,298,1,22,0
1,298,2,7,1
2,298,3,23,0
3,298,4,28,1
4,298,5,1,0
5,298,6,29,0
6,298,7,30,2
7,298,8,8,0
8,298,9,29,0
9,298,10,6,0


verification: durée moyenne de visite d'un POI, pour les durée >0

In [None]:
%%sql
select round(avg(duree)/60,1) as duree_en_minutes
from Visite4
where duree>0;

Unnamed: 0,duree_en_minutes
0,65.5


#### 4a) nombre moyen de visites et nombre moyen de POI dans une séquence

In [None]:
%%sql
create or replace temp view SeqNbVisite as
select seqID, count(duree) as nbVisite, count( distinct poiID) nbPoi
from Visite4
group by seqID;

select avg(nbVisite), avg(nbPoi)
from SeqNbVisite;

Unnamed: 0,avg(nbVisite),avg(nbPoi)
0,1.304276,1.255902


vérification fréquence des POI

In [None]:
%%sql
select poiID, count(*) as poiFreqVisite
from Visite4
group by poiID
order by poiFreqVisite desc

Unnamed: 0,poiID,poiFreqVisite
0,21,848
1,22,719
2,30,600
3,16,563
4,23,558
5,7,520
6,11,505
7,28,418
8,25,406
9,4,355


#### 4b) Nombre de séquences selon leur nombre de visites. 

Afficher le nombre de séquences pour chaque nombre de visites existant. Afficher aussi, pour chaque nombre de visite, le nombre *cumulé* de séquences ayant **au moins** ce nombre de visite.


In [None]:
%%sql
select nbVisite, count(seqID) as nbSequences, SUM( count(seqID) ) OVER(ORDER BY nbVisite desc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as  nbre_de_seq_avec_au_moins_un_visite
from SeqNbVisite
group by nbVisite
order by nbVisite desc;

Unnamed: 0,nbVisite,nbSequences,nbre_de_seq_avec_au_moins_un_visite
0,21,1,1
1,18,1,2
2,15,1,3
3,14,1,4
4,13,1,5
5,12,1,6
6,11,2,8
7,10,2,10
8,9,6,16
9,8,7,23


Nombre de sequences selon leur nombre de POI distincts. 

Afficher aussi, pour chaque nombre de visite, le nombre cumulé de séquences ayant **au moins** ce nombre de POI distincts.

In [None]:
%%sql
select nbPoi, count(seqID) as nbSequences, SUM( count(seqID) ) OVER(ORDER BY nbPoi desc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as  nbre_de_poi_avec_au_moins_un_visite
from SeqNbVisite
group by nbPoi
order by nbPoi;

Unnamed: 0,nbPoi,nbSequences,nbre_de_poi_avec_au_moins_un_visite
0,1,5080,6057
1,2,642,977
2,3,216,335
3,4,60,119
4,5,33,59
5,6,9,26
6,7,9,17
7,8,4,8
8,9,2,4
9,10,1,2


Nombre de séquences ayant au moins un POI visité 2 fois

In [None]:
%%sql
create or replace temp view tmpseqID as
select seqID,poiID, count(*)
from Visite4
group by seqID, poiID
having count(*) > 1;

select count(distinct seqID)
from tmpseqID;

Unnamed: 0,count(DISTINCT seqID)
0,180


#### 5) Nombre de visites sur une semaine glissante
a) Définir la table Visite5a(userID, annee, mois, jour, nbVisite) : nbVisite est le nombre de visites qu'un utilisateur a fait chaque jour. 



In [None]:
%%sql

create or replace temp view Visite5a as 
select userID, year(TIMESTAMP(date)) as year,
               month(TIMESTAMP(date)) as month,
               day(TIMESTAMP(date)) as day,
               count(*) as nbVisite
from user_visits
group by userID, year(TIMESTAMP(date)), month(TIMESTAMP(date)), day(TIMESTAMP(date));

select *
from Visite5a;


Unnamed: 0,userID,year,month,day,nbVisite
0,10012675@N05,2011,10,23,8
1,10014440@N06,2008,10,5,6
2,10116041@N02,2011,9,20,1
3,101330524@N02,2008,9,17,1
4,10282509@N00,2014,3,23,11
...,...,...,...,...,...
95,19761391@N06,2011,2,23,26
96,19761391@N06,2012,4,28,1
97,20532289@N00,2007,9,10,2
98,20634971@N00,2006,8,18,1


b) En déduire la table Visite5b(userID, anne, mois, jour, nbVisite7jours) : nbVisite7jours étant le nombre de visites effectuées sur une semaine glissante.

In [None]:
%%sql

create or replace temp view Visite5b as 
select v.userID,v.year, v.month, v.day, sum(v.nbVisite), DAYOFWEEK(TIMESTAMP(u.date))
from Visite5a v , user_visits u
where v.userID == u.userID and DAYOFWEEK(TIMESTAMP(u.date)) >= 1 -- supérieur à 1 car dayofweek(lundi) = 1
group by v.userID, v.year, v.month, v.day, u.date
order by v.userID, v.year, v.month, v.day, u.date

;

select *
from Visite5b;

Unnamed: 0,userID,year,month,day,sum(nbVisite),dayofweek(date)
0,10007579@N00,2012,9,5,1,4
1,10012675@N05,2006,3,19,4,1
2,10012675@N05,2006,3,19,4,1
3,10012675@N05,2006,3,19,4,1
4,10012675@N05,2006,3,19,4,1
...,...,...,...,...,...,...
95,10014440@N06,2007,11,27,3,7
96,10014440@N06,2007,11,27,3,7
97,10014440@N06,2007,11,27,3,7
98,10014440@N06,2007,11,27,3,7


In [None]:
%%sql

create or replace temp view Visite5b as 
select userID, year, month, day, nbVisite, sum(nbVisite) OVER (PARTITION BY userID, year, month
                                                         ORDER BY day
                                                         range BETWEEN 6 PRECEDING AND CURRENT ROW) as nbVisite7jours
from Visite5a
order by userID, year, month, day
;

select *
from Visite5b;

Unnamed: 0,userID,year,month,day,nbVisite,nbVisite7jours
0,10007579@N00,2012,9,5,1,1
1,10012675@N05,2006,3,19,4,4
2,10012675@N05,2006,3,21,1,5
3,10012675@N05,2011,10,22,1,1
4,10012675@N05,2011,10,23,8,9
...,...,...,...,...,...,...
95,10627620@N06,2010,6,22,1,1
96,10627620@N06,2010,11,5,1,1
97,10627620@N06,2011,11,8,151,151
98,10627620@N06,2011,12,3,27,27


#### 6) Déplacements entre deux POI
Définir la table Duree_Deplacement(seqID, poiPosition, poiID, deplacement). 
*deplacement* est la durée depuis la fin de la visite du POI courant jusqu'au début de visite du prochain POI dans une séquence.


In [None]:
%%sql
create or replace temp view Visite6 as

select seqID, poiPosition, poiID, date, sum(deplacement) over (partition by seqID,poiID,poiPosition order by deplacement desc) as deplacements
from (select seqID, poiPosition, poiID, date, 
      last_value(date) over (order by date rows between current row and 1 following)-date as deplacement from Visite3 order by date )
order by seqID, poiID
;



select * from Visite6
where seqID in (298, 510)
order by seqID, date, poiID;


Unnamed: 0,seqID,poiPosition,poiID,date,deplacements
0,298,1,22,1371514467,4
1,298,2,7,1371514471,2442
2,298,2,7,1371514472,2441
3,298,3,23,1371516913,1
4,298,4,28,1371516914,401
5,298,4,28,1371516915,400
6,298,5,1,1371517315,3
7,298,6,29,1371517318,0
8,298,7,30,1371517318,430
9,298,7,30,1371517319,430


## Exercice 2 : YFCC

#### Données YFCC France

In [15]:
import zipfile


# YFCC France
web_dir = PUBLIC_DATASET + "/YFCC_dataset_extrait"
download_file(web_dir, local_dir, "yfccFrance.zip")



#unzip

with zipfile.ZipFile(local_dir + "/yfccFrance.zip", 'r') as zip_ref:
  zip_ref.extractall(local_dir)


os.listdir(local_dir)


downloading from URL:  https://nuage.lip6.fr/s/H3bpyRGgnCq2NR4/download?path=/YFCC_dataset_extrait/yfccFrance.zip save in : /local/data/yfccFrance.zip


['userVisits-Toro.csv', 'yfccFrance.zip', 'yfccFrance', 'POI-Toro.csv']

In [16]:
yfcc_france = spark.read.format("parquet").load(local_dir + "/yfccFrance")
print(yfcc_france.count())
yfcc_france.printSchema()
display(yfcc_france)

2052004
root
 |-- Line: long (nullable = true)
 |-- PhotoID: long (nullable = true)
 |-- PhotoHash: string (nullable = true)
 |-- UserNSID: string (nullable = true)
 |-- UserNickname: string (nullable = true)
 |-- DateTaken: string (nullable = true)
 |-- DateUploaded: long (nullable = true)
 |-- CaptureDevice: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- UserTags: string (nullable = true)
 |-- MachineTags: string (nullable = true)
 |-- Longitude: float (nullable = true)
 |-- Latitude: float (nullable = true)
 |-- Accuracy: integer (nullable = true)
 |-- URL: string (nullable = true)
 |-- DownloadURL: string (nullable = true)
 |-- LicenseName: string (nullable = true)
 |-- LicenseURL: string (nullable = true)
 |-- ServerID: integer (nullable = true)
 |-- FarmID: integer (nullable = true)
 |-- Secret: string (nullable = true)
 |-- SecretOriginal: string (nullable = true)
 |-- Extension: string (nullable = true)
 |-- Marker:

Unnamed: 0,Line,PhotoID,PhotoHash,UserNSID,UserNickname,DateTaken,DateUploaded,CaptureDevice,Title,Description,...,URL,DownloadURL,LicenseName,LicenseURL,ServerID,FarmID,Secret,SecretOriginal,Extension,Marker
0,80096796,834796,dc9b7584ecb34a448540bee3b38fe85c,77922700@N00,iko,2004-09-15 19:41:25.0,1097582515,PENTAX+Corporation+PENTAX+Optio+S4,marseille,,...,http://www.flickr.com/photos/77922700@N00/834796/,http://farm1.staticflickr.com/1/834796_9e5d1ed...,Attribution-NonCommercial-NoDerivs License,http://creativecommons.org/licenses/by-nc-nd/2.0/,1,1,9e5d1edb3a,9e5d1edb3a,jpg,0
1,64666153,5477598,4c38199a877fba534860a365bf9c97,76384935@N00,Chip_2904,2005-02-26 20:36:16.0,1109450176,,Quiz+Night+2,,...,http://www.flickr.com/photos/76384935@N00/5477...,http://farm1.staticflickr.com/6/5477598_d4ec28...,Attribution-NonCommercial-ShareAlike License,http://creativecommons.org/licenses/by-nc-sa/2.0/,6,1,d4ec281653,d4ec281653,jpg,0
2,82397565,5975164,2a76868b7eb18ea240d48e2941648582,70408381@N00,scot2342,2004-07-31 10:12:52.0,1110087781,NIKON+E4200,Giverny,Giverny+flowers+Monet+France,...,http://www.flickr.com/photos/70408381@N00/5975...,http://farm1.staticflickr.com/3/5975164_5ebb92...,Attribution-NonCommercial-NoDerivs License,http://creativecommons.org/licenses/by-nc-nd/2.0/,3,1,5ebb925aa1,5ebb925aa1,jpg,0
3,39862899,8060056,987ba5ffd786949496eda3a5fcf89b2,32323502@N00,Julie70,2005-03-31 12:22:18.0,1112338974,SONY+DSC-P150,They+were+all+in+it,There+will+be+always+loving+couples+in+Paris%2...,...,http://www.flickr.com/photos/32323502@N00/8060...,http://farm1.staticflickr.com/4/8060056_df5ed9...,Attribution-NonCommercial-ShareAlike License,http://creativecommons.org/licenses/by-nc-sa/2.0/,4,1,df5ed9e19b,df5ed9e19b,jpg,0
4,64976931,8916795,b1e9a0bbfd8a221830461d39a6c7e1d3,51035823282@N01,alexdecarvalho,2005-04-08 17:56:02.0,1113079413,Canon+DIGITAL+IXUS+40,Grupo+Corpo,,...,http://www.flickr.com/photos/51035823282@N01/8...,http://farm1.staticflickr.com/8/8916795_11c551...,Attribution-NonCommercial-ShareAlike License,http://creativecommons.org/licenses/by-nc-sa/2.0/,8,1,11c551a24b,11c551a24b,jpg,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,67598393,190189307,e04ae5f6aaec837234c28ff81c20d27b,38996820@N00,clarque,2006-06-27 02:43:02.0,1152989336,OLYMPUS+CORPORATION+C765UZ,Hike+near+Pralognan,,...,http://www.flickr.com/photos/38996820@N00/1901...,http://farm1.staticflickr.com/76/190189307_6fd...,Attribution-NoDerivs License,http://creativecommons.org/licenses/by-nd/2.0/,76,1,6fd7baff97,6fd7baff97,jpg,0
96,94436396,190670498,f360dc5abc47711ce3d54953f52483,95285464@N00,leguan001,2003-04-30 08:18:40.0,1153049290,Canon+PowerShot+S40,Hohk%C3%B6nigsburg+%28Ch%C3%A2teau+du+Haut-K%C...,,...,http://www.flickr.com/photos/95285464@N00/1906...,http://farm1.staticflickr.com/71/190670498_6f5...,Attribution-NonCommercial-NoDerivs License,http://creativecommons.org/licenses/by-nc-nd/2.0/,71,1,6f51679d0a,6f51679d0a,jpg,0
97,93144642,191107677,a0ae43beae4f87af92905d402e77f768,65648243@N00,Laurent+Paris11,2006-07-14 19:26:20.0,1153086075,NIKON+CORPORATION+NIKON+D200,_DSC0011+-+Version+2,,...,http://www.flickr.com/photos/65648243@N00/1911...,http://farm1.staticflickr.com/65/191107677_229...,Attribution-NonCommercial-NoDerivs License,http://creativecommons.org/licenses/by-nc-nd/2.0/,65,1,229652f913,229652f913,jpg,0
98,82701618,193290427,7d16a5ab81a9cc605391cd1a7eeb8ceb,40286210@N00,Bryce+Edwards,2006-07-13 20:25:28.0,1153305586,SONY+DSC-F828,DSC09269,View+from+Terrace%2C+La+Cite+Radieuse%2C+Le+Co...,...,http://www.flickr.com/photos/40286210@N00/1932...,http://farm1.staticflickr.com/72/193290427_f63...,Attribution License,http://creativecommons.org/licenses/by/2.0/,72,1,f6332a248b,f6332a248b,jpg,0


In [22]:
schema = "PhotoID long, UserNSID String, DateTaken long,Longitude Long,Latitude Long"

yfcc_france = spark.read.format("parquet").load(local_dir + "/yfccFrance")
yfcc_france.persist()
yfcc_france.createOrReplaceTempView("yfcc_france")
yfcc_france.show(4)
yfcc_france.printSchema()
display(yfcc_france)

+--------+-------+--------------------+------------+------------+--------------------+------------+--------------------+-------------------+--------------------+--------------------+-----------+---------+---------+--------+--------------------+--------------------+--------------------+--------------------+--------+------+----------+--------------+---------+------+
|    Line|PhotoID|           PhotoHash|    UserNSID|UserNickname|           DateTaken|DateUploaded|       CaptureDevice|              Title|         Description|            UserTags|MachineTags|Longitude| Latitude|Accuracy|                 URL|         DownloadURL|         LicenseName|          LicenseURL|ServerID|FarmID|    Secret|SecretOriginal|Extension|Marker|
+--------+-------+--------------------+------------+------------+--------------------+------------+--------------------+-------------------+--------------------+--------------------+-----------+---------+---------+--------+--------------------+--------------------+-

Unnamed: 0,Line,PhotoID,PhotoHash,UserNSID,UserNickname,DateTaken,DateUploaded,CaptureDevice,Title,Description,...,URL,DownloadURL,LicenseName,LicenseURL,ServerID,FarmID,Secret,SecretOriginal,Extension,Marker
0,80096796,834796,dc9b7584ecb34a448540bee3b38fe85c,77922700@N00,iko,2004-09-15 19:41:25.0,1097582515,PENTAX+Corporation+PENTAX+Optio+S4,marseille,,...,http://www.flickr.com/photos/77922700@N00/834796/,http://farm1.staticflickr.com/1/834796_9e5d1ed...,Attribution-NonCommercial-NoDerivs License,http://creativecommons.org/licenses/by-nc-nd/2.0/,1,1,9e5d1edb3a,9e5d1edb3a,jpg,0
1,64666153,5477598,4c38199a877fba534860a365bf9c97,76384935@N00,Chip_2904,2005-02-26 20:36:16.0,1109450176,,Quiz+Night+2,,...,http://www.flickr.com/photos/76384935@N00/5477...,http://farm1.staticflickr.com/6/5477598_d4ec28...,Attribution-NonCommercial-ShareAlike License,http://creativecommons.org/licenses/by-nc-sa/2.0/,6,1,d4ec281653,d4ec281653,jpg,0
2,82397565,5975164,2a76868b7eb18ea240d48e2941648582,70408381@N00,scot2342,2004-07-31 10:12:52.0,1110087781,NIKON+E4200,Giverny,Giverny+flowers+Monet+France,...,http://www.flickr.com/photos/70408381@N00/5975...,http://farm1.staticflickr.com/3/5975164_5ebb92...,Attribution-NonCommercial-NoDerivs License,http://creativecommons.org/licenses/by-nc-nd/2.0/,3,1,5ebb925aa1,5ebb925aa1,jpg,0
3,39862899,8060056,987ba5ffd786949496eda3a5fcf89b2,32323502@N00,Julie70,2005-03-31 12:22:18.0,1112338974,SONY+DSC-P150,They+were+all+in+it,There+will+be+always+loving+couples+in+Paris%2...,...,http://www.flickr.com/photos/32323502@N00/8060...,http://farm1.staticflickr.com/4/8060056_df5ed9...,Attribution-NonCommercial-ShareAlike License,http://creativecommons.org/licenses/by-nc-sa/2.0/,4,1,df5ed9e19b,df5ed9e19b,jpg,0
4,64976931,8916795,b1e9a0bbfd8a221830461d39a6c7e1d3,51035823282@N01,alexdecarvalho,2005-04-08 17:56:02.0,1113079413,Canon+DIGITAL+IXUS+40,Grupo+Corpo,,...,http://www.flickr.com/photos/51035823282@N01/8...,http://farm1.staticflickr.com/8/8916795_11c551...,Attribution-NonCommercial-ShareAlike License,http://creativecommons.org/licenses/by-nc-sa/2.0/,8,1,11c551a24b,11c551a24b,jpg,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,67598393,190189307,e04ae5f6aaec837234c28ff81c20d27b,38996820@N00,clarque,2006-06-27 02:43:02.0,1152989336,OLYMPUS+CORPORATION+C765UZ,Hike+near+Pralognan,,...,http://www.flickr.com/photos/38996820@N00/1901...,http://farm1.staticflickr.com/76/190189307_6fd...,Attribution-NoDerivs License,http://creativecommons.org/licenses/by-nd/2.0/,76,1,6fd7baff97,6fd7baff97,jpg,0
96,94436396,190670498,f360dc5abc47711ce3d54953f52483,95285464@N00,leguan001,2003-04-30 08:18:40.0,1153049290,Canon+PowerShot+S40,Hohk%C3%B6nigsburg+%28Ch%C3%A2teau+du+Haut-K%C...,,...,http://www.flickr.com/photos/95285464@N00/1906...,http://farm1.staticflickr.com/71/190670498_6f5...,Attribution-NonCommercial-NoDerivs License,http://creativecommons.org/licenses/by-nc-nd/2.0/,71,1,6f51679d0a,6f51679d0a,jpg,0
97,93144642,191107677,a0ae43beae4f87af92905d402e77f768,65648243@N00,Laurent+Paris11,2006-07-14 19:26:20.0,1153086075,NIKON+CORPORATION+NIKON+D200,_DSC0011+-+Version+2,,...,http://www.flickr.com/photos/65648243@N00/1911...,http://farm1.staticflickr.com/65/191107677_229...,Attribution-NonCommercial-NoDerivs License,http://creativecommons.org/licenses/by-nc-nd/2.0/,65,1,229652f913,229652f913,jpg,0
98,82701618,193290427,7d16a5ab81a9cc605391cd1a7eeb8ceb,40286210@N00,Bryce+Edwards,2006-07-13 20:25:28.0,1153305586,SONY+DSC-F828,DSC09269,View+from+Terrace%2C+La+Cite+Radieuse%2C+Le+Co...,...,http://www.flickr.com/photos/40286210@N00/1932...,http://farm1.staticflickr.com/72/193290427_f63...,Attribution License,http://creativecommons.org/licenses/by/2.0/,72,1,f6332a248b,f6332a248b,jpg,0


#### Question 1

Extraire de YFCC les sequences de points des utilisateurs telles que les conditions suivantes soient vérifiées :

Une séquence ne peut pas couvrir plusieurs jours. Si un utilisateur a pris des photos pendant plusieurs jours consécutifs, cela forme plusieurs séquences.

Une séquence doit contenir au moins 3 points distincts.

Chaque point doit être associé à au moins 3 utilisateurs.

On peut supposer que deux photos prises à deux positions GPS très proches (moins de *d* mètres entre les deux positions) correspondent à un même point. 


In [None]:
%%sql
cache table yfcc_france;

ok


In [None]:
from math import sin, cos, sqrt, atan2

def distance(lat1, lat2, lon1, lon2):
  R = 6373.0
  dlon = lon2 - lon1
  dlat = lat2 - lat1
  a = (sin(dlat/2))**2 + cos(lat1) * cos(lat2) * (sin(dlon/2))**2
  c = 2 * atan2(sqrt(a), sqrt(1-a))
  distance = R * c
  return distance
  
spark.udf.register("distance", distance)

<function __main__.distance(lat1, lat2, lon1, lon2)>

In [None]:
%%sql 
create or replace temp view yfcc2 as 
select y1.PhotoID, y1.UserNSID, y1.Longitude, y1.Latitude, y1.DateTaken, distance(y1.Latitude, y2.Latitude, y1.Longitude, y2.Longitude) as distance
from yfcc_france y1, yfcc_france y2 ;

cache table yfcc2;

In [None]:
%%sql 
select case when 
from yfcc2
group by PhotoID, UserNSID ;

#### Question 2

Proposer un exemple d'analyse des séquences obtenues à la question précédente suivante 3 dimensions dont au moins une doit avoir au moins 3 niveaux.

Indications : il s'agit de préciser des dimensions pour analyser les déplacements des utilisateurs : par exemple une dimension temporelle (date par mois, jour, heure), géographique (position, lieu, quartier, arrondissement), par nature de visite classe de catégorie/sous classes de catégorie, etc...