En este archivo puedes escribir lo que estimes conveniente. Te recomendamos detallar tu solución y todas las suposiciones que estás considerando. Aquí puedes ejecutar las funciones que definiste en los otros archivos de la carpeta src, medir el tiempo, memoria, etc.

In [1]:
#file_path for container directory
file_path = "/home/jovyan/work/data/raw/farmers-protest-tweets-2021-2-4.json"

### SparkSession builder and file read

In [9]:
from pyspark.sql import SparkSession

In [10]:
spark = SparkSession.builder.appName("FarmersProtestTweets").getOrCreate()

In [11]:
df = spark.read.json(file_path)

### Dataframe print schema and head

In [12]:
df.printSchema()

root
 |-- content: string (nullable = true)
 |-- conversationId: long (nullable = true)
 |-- date: string (nullable = true)
 |-- id: long (nullable = true)
 |-- lang: string (nullable = true)
 |-- likeCount: long (nullable = true)
 |-- media: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- duration: double (nullable = true)
 |    |    |-- fullUrl: string (nullable = true)
 |    |    |-- previewUrl: string (nullable = true)
 |    |    |-- thumbnailUrl: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- variants: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- bitrate: long (nullable = true)
 |    |    |    |    |-- contentType: string (nullable = true)
 |    |    |    |    |-- url: string (nullable = true)
 |-- mentionedUsers: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- created: string (nullable = true)
 |    |   

In [15]:
df.show(5)

+--------------------+-------------------+--------------------+-------------------+----+---------+--------------------+--------------------+--------------------+----------+--------------------+--------------------+----------+------------+--------------+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+
|             content|     conversationId|                date|                 id|lang|likeCount|               media|      mentionedUsers|            outlinks|quoteCount|         quotedTweet|     renderedContent|replyCount|retweetCount|retweetedTweet|              source|        sourceLabel|           sourceUrl|         tcooutlinks|                 url|                user|
+--------------------+-------------------+--------------------+-------------------+----+---------+--------------------+--------------------+--------------------+----------+--------------------+--------------------+----------+------------+----

#### Creating Temp View for Data Exploration

In [23]:
df.createOrReplaceTempView("farmers_protest")

### Exploring columns to be used in data curation

In [25]:
result = spark.sql("""
    SELECT
        content,
        mentionedUsers
    FROM
        farmers_protest
""")

result.show()

+--------------------+--------------------+
|             content|      mentionedUsers|
+--------------------+--------------------+
|The world progres...|[{NULL, NULL, NUL...|
|#FarmersProtest \...|[{NULL, NULL, NUL...|
|ਪੈਟਰੋਲ ਦੀਆਂ ਕੀਮਤਾ...|                NULL|
|@ReallySwara @roh...|[{NULL, NULL, NUL...|
|#KisanEktaMorcha ...|                NULL|
|Jai jwaan jai kis...|                NULL|
|     #FarmersProtest|                NULL|
|#ModiDontSellFarm...|                NULL|
|@mandeeppunia1 wa...|[{NULL, NULL, NUL...|
|#FarmersProtest h...|                NULL|
|கோதுமைப் பயிர்களை...|                NULL|
|@mandeeppunia1 wa...|[{NULL, NULL, NUL...|
|Another farmer, M...|                NULL|
|Jai kissan #Farme...|                NULL|
|#FarmersProtest h...|                NULL|
|ਸਰਕਾਰੇ ਨੀ ਤੇਰੇ ਕੰ...|                NULL|
|@akshaykumar Hi c...|[{NULL, NULL, NUL...|
|#ModiDontSellFarm...|                NULL|
|@taapsee watch fu...|[{NULL, NULL, NUL...|
|#FarmersProtest h...|          

#### Understanding the needs of challenge

q1. Las top 10 fechas donde hay más tweets. Mencionar el usuario (username) que más publicaciones tiene por cada uno de esos días.

**Columns: id, date, user.username**


q2. Los top 10 emojis más usados con su respectivo conteo.

**Columns: id, content**


q3. El top 10 histórico de usuarios (username) más influyentes en función del conteo de las menciones (@) que registra cada uno de ellos. 

The "mentionedUsers" at the main tweet level appear to be filled with null values, necessitating the transformation of the content to retrieve the users.

**Columns: id, content, user.username**

In [26]:
result = spark.sql("""
    SELECT
        id,
        date,
        content,
        user.username
        
    FROM
        farmers_protest
""")

result.show()

+-------------------+--------------------+--------------------+---------------+
|                 id|                date|             content|       username|
+-------------------+--------------------+--------------------+---------------+
|1364506249291784198|2021-02-24T09:23:...|The world progres...|ArjunSinghPanam|
|1364506237451313155|2021-02-24T09:23:...|#FarmersProtest \...|     PrdeepNain|
|1364506195453767680|2021-02-24T09:23:...|ਪੈਟਰੋਲ ਦੀਆਂ ਕੀਮਤਾ...| parmarmaninder|
|1364506167226032128|2021-02-24T09:23:...|@ReallySwara @roh...|  anmoldhaliwal|
|1364506144002088963|2021-02-24T09:23:...|#KisanEktaMorcha ...|     KotiaPreet|
|1364506120497360896|2021-02-24T09:23:...|Jai jwaan jai kis...|      babli_708|
|1364506076272496640|2021-02-24T09:22:...|     #FarmersProtest|Varinde17354019|
|1364505995859423234|2021-02-24T09:22:...|#ModiDontSellFarm...|    BitnamSingh|
|1364505991887347714|2021-02-24T09:22:...|@mandeeppunia1 wa...|  anmoldhaliwal|
|1364505896576053248|2021-02-24T09:22:..

In [35]:
result.printSchema()

root
 |-- id: long (nullable = true)
 |-- date: string (nullable = true)
 |-- content: string (nullable = true)
 |-- username: string (nullable = true)



##### Checking null values in columns

In [27]:
from pyspark.sql.functions import sum, when

In [28]:
null_counts = result.select([sum(when(result[col].isNull(), 1).otherwise(0)).alias(col) for col in result.columns])

null_counts.show()

+---+----+-------+--------+
| id|date|content|username|
+---+----+-------+--------+
|  0|   0|      0|       0|
+---+----+-------+--------+



Null values are not presented in the dataset

##### Checking duplicated data by id

In [33]:
duplicate_count = result.groupBy("id").count()
duplications = duplicate_count.filter(duplicate_count["count"] > 1)

In [34]:
duplications.show()

+---+-----+
| id|count|
+---+-----+
+---+-----+



There are no duplications by id in the dataset