### Case Study: Mobile App Store
### Domain: Analytics

The ever-changing mobile landscape is a challenging space to navigate. 
Android holds about 53.2% of the smartphone market, while iOS is 43%. 
To get more people to download your app, you need to make sure they 
can easily find your app. Mobile app analytics is a great way to 
understand the existing strategy to drive growth and retention of 
future user.

The data set contains more than 7000 Apple iOS mobile application details. 
The data was extracted from the iTunes Search API.

In [None]:
Tasks:
With millions of apps around nowadays, the data has become very key ,
to getting top trending apps in iOS app store. As a data scientist, 
you are required to explore the datasets including cleaning and 
transforming the dataset.

1. Load csv into spark as a text file
2. Parse the data as csv.
3. Convert bytes to MB and GB in a new column
4. List top 10 trending apps
5. The difference in the average number of screenshots displayed of highest and lowest rating apps.
6. What percentage of high rated apps support multiple languages.
7. How does app details contribute to user ratings?
8. Compare the statistics of different app groups/genres.
9. Does length of app description contribute to the ratings?
10. Create a spark-submit application for the same and print the findings in the log



In [None]:
1. Load csv into spark as a text file
2. Parse the data as csv.

In [22]:
import findspark
findspark.init()

import pyspark 
from pyspark.sql import SparkSession

In [23]:
spark = SparkSession.builder.appName("Apple").getOrCreate()
df = spark.read.option("encoding", "windows-1252").csv('6ojmkg1aw1/AppleStore.csv',inferSchema=True,header=True)
df.printSchema()

#option("encoding", "windows-1252")
#option("encoding","UTF-8")

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- track_name: string (nullable = true)
 |-- size_bytes: long (nullable = true)
 |-- currency: string (nullable = true)
 |-- price: double (nullable = true)
 |-- rating_count_tot: integer (nullable = true)
 |-- rating_count_ver: integer (nullable = true)
 |-- user_rating: double (nullable = true)
 |-- user_rating_ver: double (nullable = true)
 |-- ver: string (nullable = true)
 |-- cont_rating: string (nullable = true)
 |-- prime_genre: string (nullable = true)
 |-- sup_devices.num: integer (nullable = true)
 |-- ipadSc_urls.num: integer (nullable = true)
 |-- lang.num: integer (nullable = true)
 |-- vpp_lic: integer (nullable = true)



In [24]:
name_list = ['_c0', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating','user_rating_ver','ver', 'cont_rating', 'prime_genre', 'sup_devices', 'ipadSc_urls', 'lang', 'vpp_lic']
df = df.toDF(*name_list)

In [None]:
3. Convert bytes to MB and GB in a new column

In [25]:
from pyspark.sql.functions import format_number

df2 = df.withColumn('size_GB',format_number(((df['size_bytes']/1024)/1024),2)) #First converts as KB and then into GB

df2.select('id','track_name','size_bytes','size_GB').show(truncate=False)

+---------+--------------------------------------------------+----------+-------+
|id       |track_name                                        |size_bytes|size_GB|
+---------+--------------------------------------------------+----------+-------+
|281656475|PAC-MAN Premium                                   |100788224 |96.12  |
|281796108|Evernote - stay organized                         |158578688 |151.23 |
|281940292|WeatherBug - Local Weather, Radar, Maps, Alerts   |100524032 |95.87  |
|282614216|eBay: Best App to Buy, Sell, Save! Online Shopping|128512000 |122.56 |
|282935706|Bible                                             |92774400  |88.48  |
|283619399|Shanghai Mahjong                                  |10485713  |10.00  |
|283646709|PayPal - Send and request money safely            |227795968 |217.24 |
|284035177|Pandora - Music & Radio                           |130242560 |124.21 |
|284666222|PCalc - The Best Calculator                       |49250304  |46.97  |
|284736660|Ms. P

In [None]:
4. List top 10 trending apps

In [26]:
from pyspark.sql.functions import countDistinct,count_distinct
df.select(countDistinct('id')).show()

+------------------+
|count(DISTINCT id)|
+------------------+
|              7197|
+------------------+



In [27]:
df.count()

7197

In [29]:
df.select('id','track_name','rating_count_tot').sort('rating_count_tot',ascending=False).limit(10).show()

+---------+--------------------+----------------+
|       id|          track_name|rating_count_tot|
+---------+--------------------+----------------+
|284882215|            Facebook|         2974676|
|389801252|           Instagram|         2161558|
|529479190|      Clash of Clans|         2130805|
|420009108|          Temple Run|         1724546|
|284035177|Pandora - Music &...|         1126879|
|429047995|           Pinterest|         1061624|
|282935706|               Bible|          985920|
|553834731|    Candy Crush Saga|          961794|
|324684580|       Spotify Music|          878563|
|343200656|         Angry Birds|          824451|
+---------+--------------------+----------------+



In [None]:
5. The difference in the average number of screenshots displayed of highest and lowest rating apps.

In [47]:
#low10df = df.sort('rating_count_tot',ascending=True).limit(10)
#low10df.select('id','track_name','rating_count_tot').show(truncate=False,)
low10df = df.sort('rating_count_tot',ascending=True).limit(10).select('id','track_name','rating_count_tot')
low10df.show(truncate=False)

+---------+-----------------------------------+----------------+
|id       |track_name                         |rating_count_tot|
+---------+-----------------------------------+----------------+
|329174056|iLoan Calc (Loan calculator)       |0               |
|377321278|恵方コンパス.                      |0               |
|350480010|eBook: War and Peace               |0               |
|355709084|Jourist Weltübersetzer             |0               |
|404529222|ファッション通販 ZOZOTOWN          |0               |
|379256460|「宅建士」過去問題《受験用》       |0               |
|389543438|Der Feueralarm                     |0               |
|391947489|iSleeping by iSommeil SARL         |0               |
|394342281|Bowitter for iPhone                |0               |
|398166286|出会い系アプリ i-Mail（アイメール）|0               |
+---------+-----------------------------------+----------------+



In [31]:
from pyspark.sql.functions import mean
top10df.select(mean(top10df['rating_count_tot']).alias("Top 10 Apps rating averages")).show()

+---------------------------+
|Top 10 Apps rating averages|
+---------------------------+
|                  1483081.6|
+---------------------------+



In [32]:
low10df.select(mean(low10df['rating_count_tot']).alias("Lowest 10 Apps rating averages")).show()

+------------------------------+
|Lowest 10 Apps rating averages|
+------------------------------+
|                           0.0|
+------------------------------+



In [None]:
b6. What percentage of high rated apps support multiple languages.

In [33]:
from pyspark.sql.functions import max

max_lang = df.select(max(df['lang'])).collect()[0][0]

In [34]:
max_lang

75

In [35]:
top10df.select('id','track_name','lang',format_number(((top10df['lang']/max_lang)*100),2).alias('Language percentage')).show()

+---------+--------------------+----+-------------------+
|       id|          track_name|lang|Language percentage|
+---------+--------------------+----+-------------------+
|284882215|            Facebook|  29|              38.67|
|389801252|           Instagram|  29|              38.67|
|529479190|      Clash of Clans|  18|              24.00|
|420009108|          Temple Run|   1|               1.33|
|284035177|Pandora - Music &...|   1|               1.33|
|429047995|           Pinterest|  27|              36.00|
|282935706|               Bible|  45|              60.00|
|553834731|    Candy Crush Saga|  24|              32.00|
|324684580|       Spotify Music|  18|              24.00|
|343200656|         Angry Birds|  10|              13.33|
+---------+--------------------+----+-------------------+



In [None]:
7. How does app details contribute to user ratings

In [36]:
from pyspark.sql.functions import sum, avg

In [37]:
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- track_name: string (nullable = true)
 |-- size_bytes: long (nullable = true)
 |-- currency: string (nullable = true)
 |-- price: double (nullable = true)
 |-- rating_count_tot: integer (nullable = true)
 |-- rating_count_ver: integer (nullable = true)
 |-- user_rating: double (nullable = true)
 |-- user_rating_ver: double (nullable = true)
 |-- ver: string (nullable = true)
 |-- cont_rating: string (nullable = true)
 |-- prime_genre: string (nullable = true)
 |-- sup_devices: integer (nullable = true)
 |-- ipadSc_urls: integer (nullable = true)
 |-- lang: integer (nullable = true)
 |-- vpp_lic: integer (nullable = true)



By looking atthe features of the Appstore data. Following features seems more contributed to user ratings rating_count_tot

['size_bytes',  'currency', 'price', 'rating_count_ver', 'user_rating', 'ver',  'cont_rating', 'prime_genre', 'sup_devices', 'ipadSc_urls', 'lang', 'vpp_lic']

8. Compare the statistics of different app groups/genres.

In [None]:
By comparing the below statistics  we  found number of support devices have helped out rating_tot and By genre we  found the Medical is not most interested.

In [38]:
df.groupBy('sup_devices').agg(format_number(avg('cont_rating'),2).alias('cont_rating'),format_number(avg('sup_devices'),2).alias('sup_devices'),format_number(avg('rating_count_tot'),2).alias('rating_count_tot')).sort('sup_devices',ascending=False).show()

+-----------+-----------+-----------+----------------+
|sup_devices|cont_rating|sup_devices|rating_count_tot|
+-----------+-----------+-----------+----------------+
|         47|       null|      47.00|       15,259.42|
|         45|       null|      45.00|       12,996.62|
|         43|       null|      43.00|       13,014.39|
|         40|       null|      40.00|        7,534.39|
|         39|       null|      39.00|       23,635.60|
|         38|       null|      38.00|       13,131.64|
|         37|       null|      37.00|       15,430.83|
|         36|       null|      36.00|          902.71|
|         35|       null|      35.00|        1,040.96|
|         33|       null|      33.00|          855.00|
|         26|       null|      26.00|        1,658.64|
|         25|       null|      25.00|        1,469.55|
|         24|       null|      24.00|        7,186.11|
|         23|       null|      23.00|          412.00|
|         16|       null|      16.00|          507.62|
|         

In [39]:
df.groupBy('prime_genre').agg(format_number(avg('cont_rating'),2).alias('cont_rating'),format_number(avg('sup_devices'),2),format_number(avg('rating_count_tot'),2).alias('rating_count_tot'),).sort('prime_genre',ascending=False).show()

+-----------------+-----------+----------------------------------+----------------+
|      prime_genre|cont_rating|format_number(avg(sup_devices), 2)|rating_count_tot|
+-----------------+-----------+----------------------------------+----------------+
|          Weather|       null|                             36.65|       22,181.03|
|        Utilities|       null|                             36.69|        6,863.82|
|           Travel|       null|                             37.01|       14,129.44|
|           Sports|       null|                             36.92|       14,026.93|
|Social Networking|       null|                             36.52|       45,498.90|
|         Shopping|       null|                             36.62|       18,615.33|
|        Reference|       null|                             36.56|       22,410.84|
|     Productivity|       null|                             36.04|        8,051.33|
|    Photo & Video|       null|                             36.76|       14,

In [None]:
9. Does length of app description contribute to the ratings?

In [40]:
from pyspark.sql.functions import length
df.groupBy((length(df['track_name'])).alias('name_length')).agg(sum(df['rating_count_tot']).alias('rating_count_tot')).show()

+-----------+----------------+
|name_length|rating_count_tot|
+-----------+----------------+
|         31|          871236|
|         85|          209406|
|         65|            4514|
|         53|           28594|
|        133|             150|
|         78|            9465|
|         34|          667318|
|        101|             178|
|         81|            2720|
|         28|          580861|
|         76|             222|
|         27|          815815|
|         26|          985314|
|         44|          414457|
|         12|         4767854|
|         91|            4368|
|         22|          957206|
|        232|             358|
|         93|            2906|
|         47|         1669522|
+-----------+----------------+
only showing top 20 rows



In [49]:
df.groupBy(length(df['track_name']).alias('name_length')).agg(sum('rating_count_tot')).sort('name_length',).show()

+-----------+---------------------+
|name_length|sum(rating_count_tot)|
+-----------+---------------------+
|          2|               168922|
|          3|               416445|
|          4|               514727|
|          5|              1461038|
|          6|              1661858|
|          7|              2957862|
|          8|              5087672|
|          9|              6586364|
|         10|              5886069|
|         11|              3508290|
|         12|              4767854|
|         13|              3922493|
|         14|              5217009|
|         15|              2772429|
|         16|              2551308|
|         17|              1854143|
|         18|              3918765|
|         19|              1885464|
|         20|              1449536|
|         21|              1681704|
+-----------+---------------------+
only showing top 20 rows



In [43]:
corrdf =df.groupBy(length(df['track_name']).alias('name_length')).agg(sum('rating_count_tot').alias('rating_count_tot'))

In [44]:
df.groupBy(length(df['track_name']).cast('string').alias('name_length'))\
  .agg(sum(df['rating_count_tot']).alias('rating_count_tot'))\
  .show()

+-----------+----------------+
|name_length|rating_count_tot|
+-----------+----------------+
|          7|         2957862|
|         51|            5885|
|         15|         2772429|
|         54|            6594|
|        232|             358|
|         11|         3508290|
|        101|             178|
|         29|          704668|
|         69|            2254|
|         42|          993978|
|         87|            9068|
|         73|            3748|
|         64|            6694|
|          3|          416445|
|         30|          379795|
|        113|            1255|
|         34|          667318|
|        133|             150|
|         59|           12198|
|          8|         5087672|
+-----------+----------------+
only showing top 20 rows



In [45]:
corrdf.corr('name_length','rating_count_tot')

-0.573500480238124

Yes. Length of the application description contribute to the ratings.

In [None]:
10. Create a spark-submit application for the same and print the findings in the log

In [None]:
23/02/24 14:33:16 INFO SparkContext: Running Spark version 3.3.2
23/02/24 14:33:16 INFO ResourceUtils: ==============================================================
23/02/24 14:33:16 INFO ResourceUtils: No custom resources configured for spark.driver.
23/02/24 14:33:16 INFO ResourceUtils: ==============================================================
23/02/24 14:33:16 INFO SparkContext: Submitted application: Apple
23/02/24 14:33:16 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
23/02/24 14:33:16 INFO ResourceProfile: Limiting resource is cpu
23/02/24 14:33:16 INFO ResourceProfileManager: Added ResourceProfile id: 0
23/02/24 14:33:16 INFO SecurityManager: Changing view acls to: suzuk
23/02/24 14:33:16 INFO SecurityManager: Changing modify acls to: suzuk
23/02/24 14:33:16 INFO SecurityManager: Changing view acls groups to:
23/02/24 14:33:16 INFO SecurityManager: Changing modify acls groups to:
23/02/24 14:33:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(suzuk); groups with view permissions: Set(); users  with modify permissions: Set(suzuk); groups with modify permissions: Set()
23/02/24 14:33:16 INFO Utils: Successfully started service 'sparkDriver' on port 59286.
23/02/24 14:33:16 INFO SparkEnv: Registering MapOutputTracker
23/02/24 14:33:16 INFO SparkEnv: Registering BlockManagerMaster
23/02/24 14:33:16 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
23/02/24 14:33:16 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
23/02/24 14:33:16 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
23/02/24 14:33:16 INFO DiskBlockManager: Created local directory at C:\Users\suzuk\AppData\Local\Temp\blockmgr-b7670a56-7981-4f30-abbf-41d3fcbf774d
23/02/24 14:33:16 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB
23/02/24 14:33:17 INFO SparkEnv: Registering OutputCommitCoordinator
23/02/24 14:33:17 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/02/24 14:33:17 INFO Utils: Successfully started service 'SparkUI' on port 4041.
23/02/24 14:33:17 INFO Executor: Starting executor ID driver on host LAPTOP-TCJS4952
23/02/24 14:33:17 INFO Executor: Starting executor with user classpath (userClassPathFirst = false): ''
23/02/24 14:33:17 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 59338.
23/02/24 14:33:17 INFO NettyBlockTransferService: Server created on LAPTOP-TCJS4952:59338
23/02/24 14:33:17 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
23/02/24 14:33:17 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, LAPTOP-TCJS4952, 59338, None)
23/02/24 14:33:17 INFO BlockManagerMasterEndpoint: Registering block manager LAPTOP-TCJS4952:59338 with 366.3 MiB RAM, BlockManagerId(driver, LAPTOP-TCJS4952, 59338, None)
23/02/24 14:33:17 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, LAPTOP-TCJS4952, 59338, None)
23/02/24 14:33:17 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, LAPTOP-TCJS4952, 59338, None)
root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- track_name: string (nullable = true)
 |-- size_bytes: long (nullable = true)
 |-- currency: string (nullable = true)
 |-- price: double (nullable = true)
 |-- rating_count_tot: integer (nullable = true)
 |-- rating_count_ver: integer (nullable = true)
 |-- user_rating: double (nullable = true)
 |-- user_rating_ver: double (nullable = true)
 |-- ver: string (nullable = true)
 |-- cont_rating: string (nullable = true)
 |-- prime_genre: string (nullable = true)
 |-- sup_devices.num: integer (nullable = true)
 |-- ipadSc_urls.num: integer (nullable = true)
 |-- lang.num: integer (nullable = true)
 |-- vpp_lic: integer (nullable = true)

+---------+--------------------------------------------------+----------+-------+
|id       |track_name                                        |size_bytes|size_GB|
+---------+--------------------------------------------------+----------+-------+
|281656475|PAC-MAN Premium                                   |100788224 |96.12  |
|281796108|Evernote - stay organized                         |158578688 |151.23 |
|281940292|WeatherBug - Local Weather, Radar, Maps, Alerts   |100524032 |95.87  |
|282614216|eBay: Best App to Buy, Sell, Save! Online Shopping|128512000 |122.56 |
|282935706|Bible                                             |92774400  |88.48  |
|283619399|Shanghai Mahjong                                  |10485713  |10.00  |
|283646709|PayPal - Send and request money safely            |227795968 |217.24 |
|284035177|Pandora - Music & Radio                           |130242560 |124.21 |
|284666222|PCalc - The Best Calculator                       |49250304  |46.97  |
|284736660|Ms. PAC-MAN                                       |70023168  |66.78  |
|284791396|Solitaire by MobilityWare                         |49618944  |47.32  |
|284815117|SCRABBLE Premium                                  |227547136 |217.01 |
|284815942|Google û Search made just for mobile              |179979264 |171.64 |
|284847138|Bank of America - Mobile Banking                  |160925696 |153.47 |
|284862767|FreeCell                                          |55153664  |52.60  |
|284876795|TripAdvisor Hotels Flights Restaurants            |207907840 |198.28 |
|284882215|Facebook                                          |389879808 |371.82 |
|284910350|Yelp - Nearby Restaurants, Shopping & Services    |167407616 |159.65 |
|284993459|Shazam - Discover music, artists, videos & lyrics |147093504 |140.28 |
|285005463|Crash Bandicoot Nitro Kart 3D                     |10735026  |10.24  |
+---------+--------------------------------------------------+----------+-------+
only showing top 20 rows

+------------------+
|count(DISTINCT id)|
+------------------+
|              7197|
+------------------+

+---------+-----------------------+----------------+
|id       |track_name             |rating_count_tot|
+---------+-----------------------+----------------+
|284882215|Facebook               |2974676         |
|389801252|Instagram              |2161558         |
|529479190|Clash of Clans         |2130805         |
|420009108|Temple Run             |1724546         |
|284035177|Pandora - Music & Radio|1126879         |
|429047995|Pinterest              |1061624         |
|282935706|Bible                  |985920          |
|553834731|Candy Crush Saga       |961794          |
|324684580|Spotify Music          |878563          |
|343200656|Angry Birds            |824451          |
+---------+-----------------------+----------------+

+---------------------------+
|Top 10 Apps rating averages|
+---------------------------+
|                  1483081.6|
+---------------------------+

+------------------------------+
|Lowest 10 Apps rating averages|
+------------------------------+
|                           0.0|
+------------------------------+

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- track_name: string (nullable = true)
 |-- size_bytes: long (nullable = true)
 |-- currency: string (nullable = true)
 |-- price: double (nullable = true)
 |-- rating_count_tot: integer (nullable = true)
 |-- rating_count_ver: integer (nullable = true)
 |-- user_rating: double (nullable = true)
 |-- user_rating_ver: double (nullable = true)
 |-- ver: string (nullable = true)
 |-- cont_rating: string (nullable = true)
 |-- prime_genre: string (nullable = true)
 |-- sup_devices: integer (nullable = true)
 |-- ipadSc_urls: integer (nullable = true)
 |-- lang: integer (nullable = true)
 |-- vpp_lic: integer (nullable = true)

By looking atthe features of the Appstore data. Following features seems more contributed to user ratings rating_count_tot

['size_bytes', 'currency', 'price', 'rating_count_ver', 'user_rating', 'ver', 'cont_rating', 'prime_genre', 'sup_devices', 'ipadSc_urls', 'lang', 'vpp_lic']
By comparing the below statistics  we  found number of support devices have helped out rating_tot and By genre we  found the Medical is not most interested.
+-----------+-----------+-----------+----------------+
|sup_devices|cont_rating|sup_devices|rating_count_tot|
+-----------+-----------+-----------+----------------+
|         47|       null|      47.00|       15,259.42|
|         45|       null|      45.00|       12,996.62|
|         43|       null|      43.00|       13,014.39|
|         40|       null|      40.00|        7,534.39|
|         39|       null|      39.00|       23,635.60|
|         38|       null|      38.00|       13,131.64|
|         37|       null|      37.00|       15,430.83|
|         36|       null|      36.00|          902.71|
|         35|       null|      35.00|        1,040.96|
|         33|       null|      33.00|          855.00|
|         26|       null|      26.00|        1,658.64|
|         25|       null|      25.00|        1,469.55|
|         24|       null|      24.00|        7,186.11|
|         23|       null|      23.00|          412.00|
|         16|       null|      16.00|          507.62|
|         15|       null|      15.00|          758.00|
|         13|       null|      13.00|          458.57|
|         12|       null|      12.00|      287,589.00|
|         11|       null|      11.00|        4,361.67|
|          9|       null|       9.00|        1,760.00|
+-----------+-----------+-----------+----------------+

+-----------------+-----------+----------------------------------+----------------+
|      prime_genre|cont_rating|format_number(avg(sup_devices), 2)|rating_count_tot|
+-----------------+-----------+----------------------------------+----------------+
|          Weather|       null|                             36.65|       22,181.03|
|        Utilities|       null|                             36.69|        6,863.82|
|           Travel|       null|                             37.01|       14,129.44|
|           Sports|       null|                             36.92|       14,026.93|
|Social Networking|       null|                             36.52|       45,498.90|
|         Shopping|       null|                             36.62|       18,615.33|
|        Reference|       null|                             36.56|       22,410.84|
|     Productivity|       null|                             36.04|        8,051.33|
|    Photo & Video|       null|                             36.76|       14,352.28|
|             News|       null|                             36.56|       13,015.07|
|       Navigation|       null|                             36.24|       11,853.96|
|            Music|       null|                             35.47|       28,842.02|
|          Medical|       null|                             36.65|          592.78|
|        Lifestyle|       null|                             37.07|        6,161.76|
| Health & Fitness|       null|                             35.89|        9,913.17|
|            Games|       null|                             38.02|       13,692.00|
|     Food & Drink|       null|                             36.92|       13,938.62|
|          Finance|       null|                             36.84|       11,047.65|
|    Entertainment|       null|                             36.67|        7,533.68|
|        Education|       null|                             36.68|        2,239.23|
+-----------------+-----------+----------------------------------+----------------+
only showing top 20 rows

+-----------+---------------------+
|name_length|sum(rating_count_tot)|
+-----------+---------------------+
|          7|              2957862|
|         51|                 5885|
|         15|              2772429|
|         54|                 6594|
|        232|                  358|
|         11|              3508290|
|        101|                  178|
|         29|               704668|
|         69|                 2254|
|         42|               993978|
|         87|                 9068|
|         73|                 3748|
|         64|                 6694|
|          3|               416445|
|         30|               379795|
|        113|                 1255|
|         34|               667318|
|        133|                  150|
|         59|                12198|
|          8|              5087672|
+-----------+---------------------+
only showing top 20 rows

Yes. Length of the application description contribute to the ratings.

In [None]:
D:\BigDataLocalSetup\spark-3.2.3-bin-hadoop3.2\bin\spark-submit.cmd --master local --py-files AppleNameLengthContribute.py