# Estudi Residus Municipals Catalunya 
## Data Cleaning

### Descripció
En aquest seguit de llibretes es recull l'estudi realitzat sobre el dataset de residus municipals a Catalunya publicat per la generalitat al portal de dades obertes. La intenció es investigar aquestes dades amb la fi de trobar conclusions interessants i possibles aplicacions.

En aquesta llibreta en concret es netegen les dades del dataset [Municipis Catalunya Geo](https://analisi.transparenciacatalunya.cat/Urbanisme-infraestructures/Municipis-Catalunya-Geo/9aju-tpwc) per a posteriorment ser enrriquides amb valors geografics i posteriorment realitzar un estudi sobre el territori que conforma Catalunya a nivell municipal.

### Autors
Joaquim Picó Mora, Marc Felip Pomes

In [3]:
import pandas as pd
from pyspark.sql import SparkSession
import pyspark.sql.functions as f

### Lista de les primeres idees de cleening
- Molts nulls: Es pot deixar perque al final només volem visualitzar o es poden ficar a 0
    - Autocompostatge 
    - RAEE = Residus d'aparells electronics
    - Ferralla
    - Olis vegetals
    - Runes
    - Residus especials en petites quantitats
- Resta
    - Agrupar (Resta a Diposit + Resta a Incineració + Resta a tractament Mecànic Biològic) -> Resta (sense desglosar) 

In [4]:
spark = (SparkSession
 .builder
 .appName("WasteCleaning")
 .getOrCreate())

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/22 07:59:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [5]:
spark.conf.set("spark.sql.repl.eagerEval.enabled", True)

In [10]:
waste_file = "./datasets/enriched/residus_municipi_geo.csv"
schema_df = "`Latitud` FLOAT,\
            `Longitud` FLOAT,\
            `geometry` STRING,\
            `Any` INT,\
            `Codi municipi` STRING,\
            `Municipi` STRING,\
            `Comarca` STRING,\
            `Població` FLOAT,\
            `Autocompostatge` FLOAT,\
            `Matèria orgànica` FLOAT,\
            `Poda i jardineria` FLOAT,\
            `Paper i cartró` FLOAT,\
            `Vidre` FLOAT,\
            `Envasos lleugers` FLOAT,\
            `Residus voluminosos + fusta` FLOAT,\
            `RAEE` FLOAT,\
            `Ferralla` FLOAT,\
            `Olis vegetals` FLOAT,\
            `Tèxtil` FLOAT,\
            `Runes` FLOAT,\
            `Residus Especials en petites quantitats (REPQ` FLOAT,\
            `Piles` FLOAT,\
            `Medicaments` FLOAT,\
            `Altres recollides selectives` FLOAT,\
            `R.S. / R.M. % total` FLOAT,\
            `Kg/hab/any recollida selectiva` FLOAT,\
            `Resta a Dipòsit` FLOAT,\
            `Resta a Incineració` FLOAT,\
            `Resta a Tractament Mecànic Biològic` FLOAT,\
            `Resta (sense desglossar)` FLOAT,\
            `Suma Fracció Resta` FLOAT,\
            `F.R. / R.M. %` FLOAT,\
            `Generació Residus Municipal Totals` FLOAT,\
            `Kg / hab / dia` FLOAT,\
            `Kg / hab / any` FLOAT"
df = spark.read.schema(schema_df).csv(waste_file)
df.show(n=5, truncate=False, vertical=True)

-RECORD 0-----------------------------------------------------------
 Latitud                                       | null               
 Longitud                                      | null               
 geometry                                      | geometry           
 Any                                           | null               
 Codi municipi                                 | Codi municipi      
 Municipi                                      | Municipi           
 Comarca                                       | Comarca            
 Població                                      | null               
 Autocompostatge                               | null               
 Matèria orgànica                              | null               
 Poda i jardineria                             | null               
 Paper i cartró                                | null               
 Vidre                                         | null               
 Envasos lleugers                 

In [9]:
df.summary()

                                                                                

summary,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14,_c15,_c16,_c17,_c18,_c19,_c20,_c21,_c22,_c23,_c24,_c25,_c26,_c27,_c28,_c29,_c30,_c31,_c32,_c33,_c34,_c35,_c36
count,18846.0,18847,18847,18847,18847,18847,18847,18847,18847,7543,18847,18847,18847,18847,18847,18847,9427,7543,7543,16963,8485,8485,18847,18847,18847,18847,18847,18847,15079,15079,15079,3769,18847,18847,18847,18847,18847
mean,9422.5,1.7496833571139938,41.73578833831022,,2009.5022285896212,211798.45250981642,,,1904.9704907672733,53.56642800318218,300.34759651915516,81.12913859705067,215.68673580600665,138.27554186564734,95.28503422476906,175.58867680144283,2685.885635476342,750.1617608061522,108.14691063378415,660.3448296191486,10390.008486562941,220.48385195662422,52.102302875941845,56.36750504085747,8362.930913721744,1192.735494799956,30.258513742969377,161.67141568502507,97053.67654861388,33991.35667860459,78548.66308528982,170417.73460721868,2357.920361402951,69.73618009126594,3938.540783667622,1.5279847182425972,556.9933277087964
stddev,5440.515922226494,0.7871043760314783,0.4208746583175954,,5.766949162934999,125124.55998437676,,,51902.136106834216,153.17549767633474,3050.487050485707,427.58679566569936,2045.4861739441208,1012.910711992243,605.1514557776107,1243.176738442318,18173.304620951923,2356.1515489364706,648.7590503621035,5976.136836946031,36927.65352504999,740.8188342523684,368.3525927992337,515.3350724672584,41817.65767669423,8613.75957444988,18.477830416746738,115.38615552112319,583300.1748149367,446108.45620370976,1020738.131496537,1209494.5125364142,17522.625114822207,18.483498128530172,27716.925682915153,0.6536049738985722,238.05703473308648
min,0.0,0.25057195,40.5428638,POINT (0.25057195...,2000,170010.0,Abella de la Conca,Alt Camp,1.0,0.0,0.0,0.0,0.0,0.0,-0.14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4710.0,1.064357054,41.43006742,,2004.0,82397.0,,,4.069,0.0,0.0,0.0,7.65,8.61,3.63,0.15,36.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,33.11,15.48,80.38,2014.0,0.0,0.0,6193.0,92.97,58.68,158.14,1.15,418.4
50%,9420.0,1.736184762,41.71041342,,2010.0,171766.0,,,104.0,0.0,9.14,0.0,22.25,22.55,11.88,7.33,189.0,27.0,3.0,0.0,113.0,9.0,2.0,5.0,17.0,109.83,27.48,139.15,11269.0,0.0,0.0,22128.0,266.0,72.49,446.7,1.37,498.07
75%,14132.0,2.344137229,42.06615809,,2015.0,252075.0,,,346.0,48.0,117.44,16.26,95.32,78.14,49.82,54.64,1142.0,364.0,41.0,48.0,3367.0,116.0,16.0,22.0,628.0,531.91,41.31,215.29,43727.0,0.0,0.0,79094.0,1018.05,84.51,1857.43,1.69,616.33
max,9999.0,Longitud,Latitud,geometry,Any,Codi municipi,Òrrius,Vallès Oriental,Població,Autocompostatge,Matèria orgànica,Poda i jardineria,Paper i cartró,Vidre,Envasos lleugers,Residus voluminos...,RAEE,Ferralla,Olis vegetals,Tèxtil,Runes,Residus Especials...,Piles,Medicaments,Altres recollides...,Total Recollida S...,R.S. / R.M. % total,Kg/hab/any recoll...,Resta a Dipòsit,Resta a Incineració,Resta a Tractamen...,Resta (sense desg...,Suma Fracció Resta,F.R. / R.M. %,Generació Residus...,Kg / hab / dia,Kg / hab / any


Ens interessa veure per quin motiu ho ha tants missing values a certs tipus de residus

In [48]:
print("("+"Any"+","+str(len(df.select(f.col("Any")).groupBy("Any").count().collect()))+")")

(Any,21)


In [51]:
columns = ["Autocompostatge", "RAEE", "Ferralla", "Olis vegetals", "Runes", "Residus Especials en petites quantitats (REPQ"]
for column in columns:
    print("("+column+","+str(len(df.select(f.col("Any"), f.col(column)).where(f.col(column) != None).groupBy("Any").count().collect()))+")")

(Autocompostatge,0)
(RAEE,0)
(Ferralla,0)
(Olis vegetals,0)
(Runes,0)
(Residus Especials en petites quantitats (REPQ,0)


In [49]:
print("("+"Municipi"+","+str(len(df.select(f.col("Municipi")).groupBy("Municipi").count().collect()))+")")

(Municipi,951)


In [50]:
columns = ["Autocompostatge", "RAEE", "Ferralla", "Olis vegetals", "Runes", "Residus Especials en petites quantitats (REPQ"]
for column in columns:
    print("("+column+","+str(len(df.select(f.col("Municipi"), f.col(column)).where(f.col(column) != None).groupBy("Municipi").count().collect()))+")")

(Autocompostatge,0)
(RAEE,0)
(Ferralla,0)
(Olis vegetals,0)
(Runes,0)
(Residus Especials en petites quantitats (REPQ,0)
