<a href="https://colab.research.google.com/github/sai-prat/DataAnalysis-Pyspark/blob/main/The_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Welcome to the Notebook**

### Let's mount the google drive

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Task 1 :
Installing pyspark module

In [9]:
!pip install pyspark



Importing the modules

In [10]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import count, desc , col, max
import matplotlib.pyplot as plts

creating spark session

In [11]:
spark=SparkSession.builder.appName("spark_app").getOrCreate()

# Task 2 :
importing the *Listenings.csv* file:

In [12]:
Listening_file_path='/content/drive/MyDrive/dataset/dataset/listenings.csv'
df_listening=spark.read.format("csv")\
             .option("InferSchema","True")\
             .option("header","True")\
             .load(Listening_file_path)

let's check the data:

In [13]:
df_listening.show()

+-----------+-------------+--------------------+---------------+--------------------+
|    user_id|         date|               track|         artist|               album|
+-----------+-------------+--------------------+---------------+--------------------+
|000Silenced|1299680100000|           Price Tag|       Jessie J|         Who You Are|
|000Silenced|1299679920000|Price Tag (Acoust...|       Jessie J|           Price Tag|
|000Silenced|1299679440000|Be Mine! (Ballad ...|          Robyn|            Be Mine!|
|000Silenced|1299679200000|            Acapella|          Kelis|            Acapella|
|000Silenced|1299675660000|   I'm Not Invisible|      The Tease|   I'm Not Invisible|
|000Silenced|1297511400000|Bounce (Feat NORE...|       MSTRKRFT|         Fist of God|
|000Silenced|1294498440000|Don't Stop The Mu...|        Rihanna|Addicted 2 Bassli...|
|000Silenced|1292438340000|               ObZen|      Meshuggah|               ObZen|
|000Silenced|1292437740000|   Yama's Messengers|      

printSchema() : It's method used to display the **schema** of the dataframe.Method of dataframe object.Displays column name and its datatypes of dataframe.
        
        df.printSchema()
schema : It's an attribute used to display the schema of the dataframe in programmatic way.Attribute of dataframe object.

        df.schema

columns : It's an attribute retrieves the list of column names in the DataFrame.Attribute of dataframe object.Shows column names in a list format as output

        df.columns

show() :It's method used to display the **data** of the dataframe.Method of dataframe object.By default it will display only 20 records.If you pass any specific number in the show method,it will display only those.

        df.show()




Difference b/w pandas dataframe and pyspark dataframe

1) ***Pandas dataframes*** are part of python ecosystem,immutable(cant change df once created),we can use only for small- medium sized data.Fits into memory of a single machine.No fault tolerance(not available during failuure).Mutable(Can chnage the data and structure later)

2) ***pyspark dataframes*** are part of apache spark,designed for handling large scale data that is distibuted across cluster of machines.Offers fault tolerance.Immutable(Cannot change)

Let's print df_listening and see the what's in it

In [24]:
id(df_listening)

132745964056544

In pyspark,even if two dataframes have the same content,they might still be different dataframe objects in memory due to immutable nature of datframes

let's print the schema of the dataframe

In [14]:
df_listening.printSchema()

root
 |-- user_id: string (nullable = true)
 |-- date: long (nullable = true)
 |-- track: string (nullable = true)
 |-- artist: string (nullable = true)
 |-- album: string (nullable = true)



let's print the schema of the dataframe in another way using SCHEMA attribute

In [15]:
df_listening.schema

StructType([StructField('user_id', StringType(), True), StructField('date', LongType(), True), StructField('track', StringType(), True), StructField('artist', StringType(), True), StructField('album', StringType(), True)])

Display the list of columns in a dataframe

In [16]:
df_listening.columns

['user_id', 'date', 'track', 'artist', 'album']

***let's delete useless columns:***

In [20]:
df_listening=df_listening.drop('date')

Now Lets check the object id of df_listening

In [22]:
id(df_listening)

132745964056544

old df_listening object id=132745966436256

new df_listening object id=132745964056544

you might get a doubt that df_listening is a df and immutable.but how come below statemnet works

df_listening=df_listening.drop('date')

HerE df_listening is a variable not dataframae,this df_listening.drop('date') gives new dataframe ,so this new dataframe has new object id,that new object id is reassigned to df_listening variable.therefore ,df_listening refers to new dataframe and gives values of new dataframae

show now what is inside of df_listening.     #date column will be removed

In [23]:
df_listening.show()

+-----------+--------------------+---------------+--------------------+
|    user_id|               track|         artist|               album|
+-----------+--------------------+---------------+--------------------+
|000Silenced|           Price Tag|       Jessie J|         Who You Are|
|000Silenced|Price Tag (Acoust...|       Jessie J|           Price Tag|
|000Silenced|Be Mine! (Ballad ...|          Robyn|            Be Mine!|
|000Silenced|            Acapella|          Kelis|            Acapella|
|000Silenced|   I'm Not Invisible|      The Tease|   I'm Not Invisible|
|000Silenced|Bounce (Feat NORE...|       MSTRKRFT|         Fist of God|
|000Silenced|Don't Stop The Mu...|        Rihanna|Addicted 2 Bassli...|
|000Silenced|               ObZen|      Meshuggah|               ObZen|
|000Silenced|   Yama's Messengers|         Gojira|The Way of All Flesh|
|000Silenced|On the Brink of E...|   Napalm Death|Time Waits For No...|
|000Silenced|On the Brink of E...|   Napalm Death|Time Waits For

Drop NULL records in a dataframe
df.na.drop(how='a

***drop the null rows:***

In [None]:
df_listening=df_listening.na.dr

df

let's check the dataset again:

let's see the schema:

let's see the shape of our dataframe:

# Task 3:

**Query #0:**
select two columns: track and artist

**Query #1**:

Let's find all of the records of those users who have listened to ***Rihanna***

**Query #2:**

Let's find top 10 users who are fan of ***Rihanna***

**Query #3:**

find top 10 famous tracks

**Query #4:**

find top 10 famous tracks of ***Rihanna***

**Query #5:**

find top 10 famous albums

# Task 4 :
importing the ***genre.csv*** file:

let's check the data

Let's inner join these two data frames

**Query #6**

find top 10 users who are fan of ***pop*** music

**Query #7**

find top 10 famous genres

# Task 5:
**Query #8**

find out each user favourite genre

**Query #9**

find out how many pop,rock,metal and hip hop singers we have

and then visulize it using bar chart

Now, let's visualize the results using ***matplotlib***

now lets visualize these two lists using a bar chart