En este archivo puedes escribir lo que estimes conveniente. Te recomendamos detallar tu solución y todas las suposiciones que estás considerando. Aquí puedes ejecutar las funciones que definiste en los otros archivos de la carpeta src, medir el tiempo, memoria, etc.

In [1]:
file_path = "/home/jovyan/work/data/raw/farmers-protest-tweets-2021-2-4.json"


### SparkSession builder and file read

In [9]:
from pyspark.sql import SparkSession

In [10]:
spark = SparkSession.builder.appName("FarmersProtestTweets").getOrCreate()

In [11]:
df = spark.read.json(file_path)

### Dataframe print schema and head

In [12]:
df.printSchema()

root
 |-- content: string (nullable = true)
 |-- conversationId: long (nullable = true)
 |-- date: string (nullable = true)
 |-- id: long (nullable = true)
 |-- lang: string (nullable = true)
 |-- likeCount: long (nullable = true)
 |-- media: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- duration: double (nullable = true)
 |    |    |-- fullUrl: string (nullable = true)
 |    |    |-- previewUrl: string (nullable = true)
 |    |    |-- thumbnailUrl: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- variants: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- bitrate: long (nullable = true)
 |    |    |    |    |-- contentType: string (nullable = true)
 |    |    |    |    |-- url: string (nullable = true)
 |-- mentionedUsers: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- created: string (nullable = true)
 |    |   

In [15]:
df.show(5)

+--------------------+-------------------+--------------------+-------------------+----+---------+--------------------+--------------------+--------------------+----------+--------------------+--------------------+----------+------------+--------------+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+
|             content|     conversationId|                date|                 id|lang|likeCount|               media|      mentionedUsers|            outlinks|quoteCount|         quotedTweet|     renderedContent|replyCount|retweetCount|retweetedTweet|              source|        sourceLabel|           sourceUrl|         tcooutlinks|                 url|                user|
+--------------------+-------------------+--------------------+-------------------+----+---------+--------------------+--------------------+--------------------+----------+--------------------+--------------------+----------+------------+----

#### Creating Temp View for Data Exploration

In [23]:
df.createOrReplaceTempView("farmers_protest")

### Exploring columns to be used in data curation

In [25]:
result = spark.sql("""
    SELECT
        content,
        mentionedUsers
    FROM
        farmers_protest
""")

result.show()

+--------------------+--------------------+
|             content|      mentionedUsers|
+--------------------+--------------------+
|The world progres...|[{NULL, NULL, NUL...|
|#FarmersProtest \...|[{NULL, NULL, NUL...|
|ਪੈਟਰੋਲ ਦੀਆਂ ਕੀਮਤਾ...|                NULL|
|@ReallySwara @roh...|[{NULL, NULL, NUL...|
|#KisanEktaMorcha ...|                NULL|
|Jai jwaan jai kis...|                NULL|
|     #FarmersProtest|                NULL|
|#ModiDontSellFarm...|                NULL|
|@mandeeppunia1 wa...|[{NULL, NULL, NUL...|
|#FarmersProtest h...|                NULL|
|கோதுமைப் பயிர்களை...|                NULL|
|@mandeeppunia1 wa...|[{NULL, NULL, NUL...|
|Another farmer, M...|                NULL|
|Jai kissan #Farme...|                NULL|
|#FarmersProtest h...|                NULL|
|ਸਰਕਾਰੇ ਨੀ ਤੇਰੇ ਕੰ...|                NULL|
|@akshaykumar Hi c...|[{NULL, NULL, NUL...|
|#ModiDontSellFarm...|                NULL|
|@taapsee watch fu...|[{NULL, NULL, NUL...|
|#FarmersProtest h...|          

#### Understanding the needs of challenge

q1. Las top 10 fechas donde hay más tweets. Mencionar el usuario (username) que más publicaciones tiene por cada uno de esos días.

**Columns: id, date, user.username**


q2. Los top 10 emojis más usados con su respectivo conteo.

**Columns: id, content**


q3. El top 10 histórico de usuarios (username) más influyentes en función del conteo de las menciones (@) que registra cada uno de ellos. 

The "mentionedUsers" at the main tweet level appear to be filled with null values, necessitating the transformation of the content to retrieve the users.

**Columns: id, content, user.username**

In [26]:
result = spark.sql("""
    SELECT
        id,
        date,
        content,
        user.username
        
    FROM
        farmers_protest
""")

result.show()

+-------------------+--------------------+--------------------+---------------+
|                 id|                date|             content|       username|
+-------------------+--------------------+--------------------+---------------+
|1364506249291784198|2021-02-24T09:23:...|The world progres...|ArjunSinghPanam|
|1364506237451313155|2021-02-24T09:23:...|#FarmersProtest \...|     PrdeepNain|
|1364506195453767680|2021-02-24T09:23:...|ਪੈਟਰੋਲ ਦੀਆਂ ਕੀਮਤਾ...| parmarmaninder|
|1364506167226032128|2021-02-24T09:23:...|@ReallySwara @roh...|  anmoldhaliwal|
|1364506144002088963|2021-02-24T09:23:...|#KisanEktaMorcha ...|     KotiaPreet|
|1364506120497360896|2021-02-24T09:23:...|Jai jwaan jai kis...|      babli_708|
|1364506076272496640|2021-02-24T09:22:...|     #FarmersProtest|Varinde17354019|
|1364505995859423234|2021-02-24T09:22:...|#ModiDontSellFarm...|    BitnamSingh|
|1364505991887347714|2021-02-24T09:22:...|@mandeeppunia1 wa...|  anmoldhaliwal|
|1364505896576053248|2021-02-24T09:22:..

In [35]:
result.printSchema()

root
 |-- id: long (nullable = true)
 |-- date: string (nullable = true)
 |-- content: string (nullable = true)
 |-- username: string (nullable = true)



##### Checking null values in columns

In [27]:
from pyspark.sql.functions import sum, when

In [28]:
null_counts = result.select([sum(when(result[col].isNull(), 1).otherwise(0)).alias(col) for col in result.columns])

null_counts.show()

+---+----+-------+--------+
| id|date|content|username|
+---+----+-------+--------+
|  0|   0|      0|       0|
+---+----+-------+--------+



Null values are not presented in the dataset

##### Checking duplicated data by id

In [33]:
duplicate_count = result.groupBy("id").count()
duplications = duplicate_count.filter(duplicate_count["count"] > 1)

In [34]:
duplications.show()

+---+-----+
| id|count|
+---+-----+
+---+-----+



There are no duplications by id in the dataset

# Challenge Solution

For each question, I will utilize memory usage and execution time measurements.

Memory_profiler: I will analyze each step of my code to understand possible refinements of memory usage during each stage of my data processing. To achieve this, I'll use the memory_profiler library to profile memory consumption at various points in my code. This will provide insights into memory-intensive operations that can be optimized.

Time: To measure the execution time of my data processing, I'll use the datetime differences approach. In the Jupyter notebook, I will record the start and end times before and after the code execution and calculate the time difference. This will help me assess the performance of my code and identify areas that may benefit from time optimization.

##### General observations


It is important to analyze the volume of your data to define what kind of memory usage you want to define in your SparkSession:

```

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("FarmersProtestTweets") \
    .config("spark.executor.memory", "MEMORY ALOCATED") \
    .config("spark.driver.memory", "MEMORY ALOCATED") \
    .getOrCreate()
```

For this specific case, after the data transformation, we end up with a dataset of only 27 MB. In such cases, I prefer to let my SparkSession use the default memory allocation settings. However, for different scenarios, you can choose to persist your data on disk. This may slightly increase data processing time but reduce the demand on driver memory.

In [1]:
# Datetime to analyze execution time
from datetime import datetime

In [2]:
STAGING_DATA_PATH = "../data/staging/farmers-protest-tweets-staging.csv"

In [15]:
from pyspark.sql import SparkSession

In [26]:
sparkSession_default = SparkSession.builder\
    .appName("FarmersProtestTweets")\
    .getOrCreate()


23/11/02 16:20:33 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [28]:
sparkSession_optimization = SparkSession.builder \
    .appName("FarmersProtestTweetsOptmization") \
    .config("spark.executor.memory", "4g") \
    .config("spark.driver.memory", "4g") \
    .getOrCreate()

In [27]:
sparkSession_default.stop()

## Q1 Memory

In [3]:
from q1_memory import q1_memory

In [4]:
start_time = datetime.now()
result = q1_memory(STAGING_DATA_PATH)
end_time = datetime.now()


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/02 15:51:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

Filename: /Users/rtakeshi/Documents/Projetos/challenge-DE/src/q1_memory.py

Line #    Mem usage    Increment  Occurrences   Line Contents
    17     88.9 MiB     88.9 MiB           1   @profile
    18                                         def q1_memory(file_path: str) -> List[Tuple[datetime.date, str]]:
    19                                             
    20     90.6 MiB      1.7 MiB           1       spark = SparkSession.builder.appName("FarmersProtestTweets").getOrCreate()
    21     90.6 MiB      0.0 MiB           1       df = spark.read.option('delimiter', '~').option('header', True).option('multiline', True).schema(STAGING_SCHEMA).csv(file_path)
    22                                             #Top 10 dates with more content
    23     90.6 MiB      0.0 MiB           1       date_counts = df.groupBy('date').agg(count('content').alias('date_count'))
    24     90.6 MiB      0.0 MiB           1       date_counts = date_counts.orderBy(col('date_count').desc()).limit(10)
    25

In [7]:
print('Duration: {}'.format(end_time - start_time))

Duration: 0:00:21.395270


### Analysis

In this solution, memory usage remains stable throughout the data transformations required to solve the challenge. 

By using the default configurations of PySpark, it is evident that memory is being well managed by my Spark instance.

The primary increase in memory usage in my method is associated with data collection, when PySpark actually performs the data processing operations defined by the default lazy mode of PySpark.

## Q1 Time

In [8]:
from q1_time import q1_time

In [9]:
start_time = datetime.now()
result = q1_time(STAGING_DATA_PATH)
end_time = datetime.now()


                                                                                

Filename: /Users/rtakeshi/Documents/Projetos/challenge-DE/src/q1_time.py

Line #    Mem usage    Increment  Occurrences   Line Contents
    18     91.5 MiB     91.5 MiB           1   @profile
    19                                         def q1_time(file_path: str) -> List[Tuple[datetime.date, str]]:
    20                                                 
    21     91.5 MiB      0.0 MiB           2       spark = SparkSession.builder.appName("FarmersProtestTweets").config("spark.executor.memory", "3g") \
    22     91.5 MiB      0.0 MiB           1       .config("spark.driver.memory", "3g").getOrCreate()
    23                                         
    24                                         
    25     91.5 MiB      0.0 MiB           1       df = spark.read.option('delimiter', '~').option('header', True).option('multiline', True).schema(STAGING_SCHEMA).csv(file_path)
    26                                         
    27                                             #Top 10 dates

In [10]:
print('Duration: {}'.format(end_time - start_time))

Duration: 0:00:04.734006


### Q1 Conclusions

In [11]:
LARGER_DATA_DIR = "../data/test/test_volume_data.csv"

In [12]:
start_time = datetime.now()
result = q1_memory(LARGER_DATA_DIR)
end_time = datetime.now()

                                                                                

Filename: /Users/rtakeshi/Documents/Projetos/challenge-DE/src/q1_memory.py

Line #    Mem usage    Increment  Occurrences   Line Contents
    17     91.5 MiB     91.5 MiB           1   @profile
    18                                         def q1_memory(file_path: str) -> List[Tuple[datetime.date, str]]:
    19                                             
    20     91.6 MiB      0.0 MiB           1       spark = SparkSession.builder.appName("FarmersProtestTweets").getOrCreate()
    21     91.6 MiB      0.0 MiB           1       df = spark.read.option('delimiter', '~').option('header', True).option('multiline', True).schema(STAGING_SCHEMA).csv(file_path)
    22                                             #Top 10 dates with more content
    23     91.6 MiB      0.0 MiB           1       date_counts = df.groupBy('date').agg(count('content').alias('date_count'))
    24     91.6 MiB      0.0 MiB           1       date_counts = date_counts.orderBy(col('date_count').desc()).limit(10)
    25

In [13]:
print('Duration: {}'.format(end_time - start_time))

Duration: 0:00:46.826039


In [14]:
start_time = datetime.now()
result = q1_time(LARGER_DATA_DIR)
end_time = datetime.now()



                                                                                

Filename: /Users/rtakeshi/Documents/Projetos/challenge-DE/src/q1_time.py

Line #    Mem usage    Increment  Occurrences   Line Contents
    18     91.6 MiB     91.6 MiB           1   @profile
    19                                         def q1_time(file_path: str) -> List[Tuple[datetime.date, str]]:
    20                                                 
    21     91.7 MiB      0.0 MiB           2       spark = SparkSession.builder.appName("FarmersProtestTweets").config("spark.executor.memory", "3g") \
    22     91.7 MiB      0.1 MiB           1       .config("spark.driver.memory", "3g").getOrCreate()
    23                                         
    24                                         
    25     91.7 MiB      0.0 MiB           1       df = spark.read.option('delimiter', '~').option('header', True).option('multiline', True).schema(STAGING_SCHEMA).csv(file_path)
    26                                         
    27                                             #Top 10 dates

In [None]:
print('Duration: {}'.format(end_time - start_time))