In [1]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, BooleanType, LongType
from pyspark.sql.functions import *
from operator import add

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
69,application_1671409217564_0091,pyspark3,idle,Link,Link,✔


SparkSession available as 'spark'.


In [2]:
filePath = 'abfs://dda-2022-12-15t21-04-18-212z@ddasta.dfs.core.windows.net/reddit/2020/'
years = ['RC_2020-01.json', 'RC_2020-02.json', 'RC_2020-03.json', 'RC_2020-04.json', 'RC_2020-05.json', 'RC_2020-06.json', 'RC_2020-07.json', 'RC_2020-08.json', 'RC_2020-09.json', 'RC_2020-10.json', 'RC_2020-11.json', 'RC_2020-12.json']
jsonFiles = []

for year in years:
    path = filePath+year
    jsonFiles.append(path)

schema = StructType([
        StructField('subreddit', StringType(), nullable=True),
        StructField('author', StringType(), nullable=True),
        StructField('author_premium', BooleanType(), nullable=True),
        StructField('score', LongType(), nullable=True)
        ])

df = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)
# Loop through the list of JSON files and read them
for file in jsonFiles:
    temp_df = spark.read.json(file, schema=schema)
    df = df.union(temp_df)

df.printSchema()

root
 |-- subreddit: string (nullable = true)
 |-- author: string (nullable = true)
 |-- author_premium: boolean (nullable = true)
 |-- score: long (nullable = true)

In [3]:
premiumAuthors = df.filter(df.author_premium == True)
nonPremiumAuthors = df.filter(df.author_premium != True)

# Self Directed Analysis - 1 Premium vs Non-Premium Users

## Q) Are premium users more active than users who are not premium

### Before Analysis Asssumption

Yes, these are users who pay for an essentially free service. Normally one would guess that it's because they are so active on reddit, that they think it's well worth their money to actually be able to give and receive awards and overall have the feeling of a "standing out" on reddit

## Q) Do premium users get more upvotes on their comments than non-premium user

### Before Analysis Assumption

Honestly, not really sure. Maybe they are on premium because they "know" how to play the reddit "game" - essentially gathering as many upvotes as possible. But at the same time on an anonymous social media site like Reddit, people might not care about the other user being a "premium" user.
In fact knowing about reddit, people might just downvote them because they are premium users! 

Another question - an not really sure how we can find this out accurately is, whether reddit promotes these premium users more.
A better way of doing this can be comparing the average comment of a premium user to that of a non-premium user, active in the same subreddit. But this can be tricky, since context matters alot, and this dataset does not explicitly provide that.

## Q) Whats the subreddit with the highest average of premium users

### Before Analysis Assumption

Again, don't really know, might be some popular subreddit such as askreddit for instance, but not sure


# Are Premium users more active on reddit than non premium users?





* 2020 data only

In [4]:
averageNumberOfPostsPremium = premiumAuthors.groupBy(premiumAuthors.author)\
                                .count()\
                                .alias("count")

In [5]:
type(averageNumberOfPostsPremium)

<class 'pyspark.sql.dataframe.DataFrame'>

In [6]:
averageNumberOfPostsPremium = averageNumberOfPostsPremium.agg(avg(col("count")))

In [7]:
averageNumberOfPostsPremium.show()

+------------------+
|        avg(count)|
+------------------+
|205.88356448637913|
+------------------+

In [9]:
averageNumberOfPostsNormal = nonPremiumAuthors.groupBy(nonPremiumAuthors.author)\
                                .count()\
                                .alias("count")

averageNumberOfPostsNormal = averageNumberOfPostsNormal.agg(avg(col("count")))

In [10]:
averageNumberOfPostsNormal.show()

+------------------+
|        avg(count)|
+------------------+
|61.885745011191894|
+------------------+

## Finding

As seen form the above averages, premium users are much more active on reddit than non-premium users. Aout 3.5 times as much. 

Although the dataset might be a bit skewed, since there are alot of non-premium users with only 1 or 2 comments, and that can skew the results

In contrast, as metioned earlier, you only take the premium membership, if you are an active member of reddit (or have cash to burn, in which case lemme know!)

# What subreddit with the highest average count of premium users

In [12]:
subredditPremiumUser = df.filter(col("author_premium") == True)\
                            .groupBy("subreddit")\
                            .count()\
                            .alias("count")\
                            .sort(desc("count"))

In [13]:
type(subredditPremiumUser)

<class 'pyspark.sql.dataframe.DataFrame'>

In [14]:
subredditPremiumUser.take(5)

[Row(subreddit='AskReddit', count=4218891), Row(subreddit='politics', count=2792589), Row(subreddit='memes', count=1861198), Row(subreddit='wallstreetbets', count=1822647), Row(subreddit='dankmemes', count=1347485)]

In [21]:
subredditNormalUser = df.filter(col("author_premium") == False)\
                            .groupBy("subreddit")\
                            .count()\
                            .alias("count")\
                            .sort(desc("count"))

In [22]:
subredditNormalUser.take(5)

[Row(subreddit='AskReddit', count=70561047), Row(subreddit='memes', count=27572044), Row(subreddit='politics', count=25305408), Row(subreddit='teenagers', count=15348705), Row(subreddit='wallstreetbets', count=14124304)]

# Do premium users get more upvotes on their comments than non-premium user

### exploratory analysis

In [7]:
averageScoreOfPostsPremium = premiumAuthors.groupBy(premiumAuthors.author)\
                                .agg(avg(col("score")))
                                
# averageScoreOfPostsPremium = averageScoreOfPostsPremium.agg(avg(col("count")))
averageScoreOfPostsPremium.show()

+--------------------+------------------+
|              author|        avg(score)|
+--------------------+------------------+
|          Truth_Moab| 6.133333333333334|
|           Darius510| 2.834538152610442|
|nationalfilmandfashi|13.102316602316602|
|        adorationn01|               1.0|
|      Towering_Flesh|17.184647302904565|
|         LeDocteurNo| 3.582010582010582|
|    TuesdayJulyNever|18.933001107419713|
|           TheApiary|7.7406015037593985|
|           mewthulhu| 52.75672514619883|
|       coolest-llama| 4.390977443609023|
|            Lit3Bolt|32.717171717171716|
|            thexvoid| 7.263500325309043|
|       RekerOfScrubs|              8.85|
|              Dalmah|7.2926829268292686|
| SmallGirlBigTitties|2.7388663967611335|
|             vaskark| 9.680232558139535|
|           guimontag| 6.089206066012489|
|       primetimemime| 3.888888888888889|
|               Ohlav| 4.558139534883721|
|         coin_return| 6.849658314350798|
+--------------------+------------

In [6]:
averageScoreOfPostsPremium = premiumAuthors.groupBy(premiumAuthors.author)\
                                .agg(avg(col("score")))\
                                .agg(sum(col("avg(score)")))\
                                .show()

+-----------------+
|  sum(avg(score))|
+-----------------+
|6417827.648984599|
+-----------------+

In [7]:
averageScoreOfPostsPremium = premiumAuthors.groupBy(premiumAuthors.author)\
                                .agg(avg(col("score")))\
                                .agg(avg(col("avg(score)")))\
                                .show()

+------------------+
|   avg(avg(score))|
+------------------+
|11.382356074390957|
+------------------+

In [8]:
averageScoreOfPostsNormal = nonPremiumAuthors.groupBy(nonPremiumAuthors.author)\
                                .agg(avg(col("score")))\
                                .agg(sum(col("avg(score)")))\
                                .show()

+--------------------+
|     sum(avg(score))|
+--------------------+
|1.3737728460159978E8|
+--------------------+

In [9]:
averageScoreOfPostsNormal = nonPremiumAuthors.groupBy(nonPremiumAuthors.author)\
                                .agg(avg(col("score")))\
                                .agg(avg(col("avg(score)")))\
                                .show()

+-----------------+
|  avg(avg(score))|
+-----------------+
|4.678692483660346|
+-----------------+

# Output

As you can see from above the avg score of a normal user is 4.67 and a premium user is 11.38. That is fitting with rest of the analysis we have done

This notebook is a bit messy, but the next few cells are just me playing around

In [5]:
averageScoreOfPostsPremium.take(5)

[Row(author='JayzTwoCents', avg(score)=21467.0), Row(author='Poldorovskij', avg(score)=17305.5), Row(author='FrontButtPunt', avg(score)=11770.0), Row(author='Borat', avg(score)=10955.130434782608), Row(author='Bowgoog71', avg(score)=10701.666666666666)]

In [8]:
averageNumberOfPostsNormal = nonPremiumAuthors.groupBy(nonPremiumAuthors.author)\
                                .agg(avg(col("score")))

averageNumberOfPostsNormal.show()

+--------------------+------------------+
|              author|        avg(score)|
+--------------------+------------------+
|    hope_still_flies|2.1715210355987056|
|      CarmenFandango| 46.58226924480117|
|   GorillaGlueWookie|10.679297597042513|
|              Morwon|               7.6|
|            Ghosty66| 8.970967741935484|
|collegiatecollegeguy| 6.290094654046127|
|            Boko_Met|3.0483870967741935|
|           gncurrier|16.014102241249056|
|         strangefolk|11.850987432675044|
|    Thin-Course-4054| 5.326530612244898|
|           matrix556|1.5714285714285714|
|          hulapookie|  4.50531914893617|
|  ChampionOfKirkwall|  10.5739357729649|
|  a_rather_quiet_one| 7.170731707317073|
|         xXsquanchXx| 9.973033707865168|
|      SkollFenrirson|17.047824980920886|
|            librious|5.8891454965357966|
|          tinybull65|1.2239057239057238|
|            leahcars| 2.269230769230769|
|           Malavin81| 3.633093525179856|
+--------------------+------------

In [6]:
averageNumberOfPostsNormal = nonPremiumAuthors.groupBy(nonPremiumAuthors.author)\
                                .agg(avg(col("score")))\
                                .sort(desc(col("avg(score)")))

averageNumberOfPostsNormal.take(5)

[Row(author='confusedbarney', avg(score)=60843.0), Row(author='WilletteKinoshita', avg(score)=48916.0), Row(author='KynanArroyo', avg(score)=45922.0), Row(author='FernandeWorm', avg(score)=41584.0), Row(author='pasomider', avg(score)=37709.0)]