<a href="https://colab.research.google.com/github/rahul-tc/Data-Analysis-Project/blob/main/Rahul_Indian_by_Heart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Indian by Heart_Revenue Analysis

This dataset captures detailed information about a collection of videos published by "Indian by Heart" channel, on a popular streaming platform. We will be using this data to analyse the given data using SQL query. To understand the dataset, you have been provided with the following data dictionary:

| Column Name        | Description                                         | Example   |
|--------------------|-----------------------------------------------------|-----------|
| id                 | Unique identifier for each video                    | 911       |
| title              | Title of the video                                  | Life of a Bollywood Stuntman |
| publish_day        | Day of the month when the video was published       | 24        |
| publish_month      | Month when the video was published                  | 4         |
| publish_year       | Year when the video was published                   | 2021      |
| video_type         | Type of the video (e.g., Shorts, Long)              | Shorts    |
| duration_seconds   | Duration of the video in seconds                    | 861       |
| view               | Number of views the video received (in thousands)   | 847.173   |
| like               | Number of likes the video received (in thousands)   | 28.765    |
| comment            | Number of comments on the video (in thousands)      | 1.289     |
| impression         | Number of impressions the video received (in thousands) | 1541.855 |
| revenue            | Revenue generated by the video (in thousands)       | 69.383    |


In [None]:
import pandas as pd
import sqlite3
import requests

# URL of the CSV file on GitHub
url = 'https://raw.githubusercontent.com/Invact-Abhay/SQL/main/SQLC2.csv'

# Download the CSV file
response = requests.get(url)
with open('ytdata.csv', 'wb') as file:
    file.write(response.content)

# Load the CSV file into a pandas DataFrame
data = pd.read_csv('ytdata.csv')

# Create a SQLite database (or connect to an existing one)
conn = sqlite3.connect('ytdata.db')

# Load the DataFrame into the SQLite database
data.to_sql('ytdata', conn, if_exists='replace', index=False)

246

**Task 1**

Retrieve the ytdata data using SQL query.

In [None]:
pd.read_sql_query("Select * from ytdata",conn)

Unnamed: 0,id,title,publish_day,publish_month,publish_year,video_type,duration_seconds,view,like,comment,impression,revenue
0,911,Life of a Bollywood Stuntman,24,4,2021,Shorts,861,847.173,28.765,1.289,1541.855,69.383
1,912,Ultimate Crorepati Hide And Seek Challenge,18,12,2021,Long,729,320.902,21.252,0.736,1399.132,62.961
2,913,Opening a Giant Mystery Box Worth 50 Lakhs,3,4,2021,Long,709,1017.456,31.108,1.628,671.521,30.218
3,914,Bank Heist Challenge: Winner Takes 10 Lakhs,26,9,2021,Long,482,500.089,23.596,1.206,1600.286,72.013
4,915,Press for a Chance to Win 10 Lakhs!,14,11,2020,Shorts,911,1455.270,27.624,0.839,6941.640,312.374
...,...,...,...,...,...,...,...,...,...,...,...,...
241,1152,Remember When COD Was Fun?,26,4,2015,Shorts,216,0.163,0.006,0.001,0.462,0.021
242,1153,Insane Gun Sync - 7 Hours To Make,21,6,2015,Long,134,0.157,0.006,0.001,0.279,0.013
243,1154,MY MESSAGE TO COD YOUTUBERS (Watch till end plz),29,5,2015,Shorts,292,0.185,0.005,0.001,0.712,0.032
244,1155,L0114R - Biblical Creeper Post for Post @L0114R,15,5,2015,Long,109,0.166,0.004,0.001,0.773,0.035


# Aggregate Functions

**Task 2**

Retrieve the unique publish year for ytdata using SQL query.

In [None]:
#pd.read_sql_query("Select publish_year from ytdata",conn)
pd.read_sql_query("Select distinct(publish_year) from ytdata",conn)

Unnamed: 0,publish_year
0,2021
1,2020
2,2019
3,2018
4,2017
5,2015
6,2016
7,2022
8,2023


**Task 3**

Find the total revenue earned.


In [None]:
pd.read_sql_query("Select sum(revenue) from ytdata",conn)

Unnamed: 0,sum(revenue)
0,15746.42


**Task 4**

Find the title of the video in the ytdata dataset with the maximum duration in seconds.

In [None]:
pd.read_sql_query("Select title,  max(duration_seconds) from ytdata",conn)

Unnamed: 0,title,max(duration_seconds)
0,"I Counted To 100,000!",85686


**Task 5**

Find the title of the video in the ytdata dataset with the maximum duration in minutes. Name the maximum duration header as MaxDurationMin

In [None]:
#pd.read_sql_query("Select title,  max(duration_seconds)/60 from ytdata",conn)
pd.read_sql_query("Select title,  max(duration_seconds)/60 as MaxDurationMin from ytdata",conn)

Unnamed: 0,title,MaxDurationMin
0,"I Counted To 100,000!",1428


**Task 6**

Find the title of the video in the ytdata dataset with the minimum duration in seconds.



In [None]:
pd.read_sql_query("Select title,  min(duration_seconds) from ytdata",conn)

Unnamed: 0,title,min(duration_seconds)
0,Indian Street Market Shopping Challenge,52


**Task 7**

Find the total numbers of videos uploaded using count function. Name the header as VideoCount

In [None]:
pd.read_sql_query("Select count(id) as VideoCount from ytdata",conn)

Unnamed: 0,VideoCount
0,246


**Task 8**


Find the total number of unique titles to ensure no duplicates exist. Name the header as DistinctVideoCount


In [None]:
pd.read_sql_query("Select count(distinct(title)) as DistinctVideoCount from ytdata",conn)

Unnamed: 0,DistinctVideoCount
0,246


**Task 9**

Find the average video duration in minutes.

In [None]:
pd.read_sql_query("Select avg(duration_seconds)/60 from ytdata",conn)

Unnamed: 0,avg(duration_seconds)/60
0,31.565244


**Task 10**

Find the average video duration in minutes. Name the header as AvgDurationMin

In [None]:
pd.read_sql_query("Select avg(duration_seconds)/60 as AvgDurationMin from ytdata",conn)

Unnamed: 0,AvgDurationMin
0,31.565244


# GROUP BY ( One column & Two column )


**Task 11**

Using SQL query, retrieve the total revenue for each year from a ytdata table and name the revenue column as  revenue_earned

In [None]:
#pd.read_sql_query("Select publish_year, sum(revenue) as revenue_earned from ytdata",conn)
pd.read_sql_query("Select publish_year, sum(revenue) as revenue_earned from ytdata group by publish_year",conn)

Unnamed: 0,publish_year,revenue_earned
0,2015,43.828
1,2016,46.558
2,2017,286.009
3,2018,2839.77
4,2019,4530.957
5,2020,4941.056
6,2021,3029.756
7,2022,2.318
8,2023,26.168


**Task 12**

Using an SQL query, retrieve the total revenue for each year and each video type from the ytdata table.

Name the revenue column as  revenue_earned, and order the results by publish_year and video_type both in ascending order.

In [None]:
pd.read_sql_query("""Select publish_year, video_type, sum(revenue) as revenue_earned
                      from ytdata
                      group by publish_year, video_type
                      order by publish_year asc, video_type asc
                      """,conn)

Unnamed: 0,publish_year,video_type,revenue_earned
0,2015,Long,19.313
1,2015,Shorts,24.515
2,2016,Long,18.033
3,2016,Shorts,28.525
4,2017,Long,55.451
5,2017,Shorts,230.558
6,2018,Long,1547.86
7,2018,Shorts,1291.91
8,2019,Long,2125.094
9,2019,Shorts,2405.863


**Task 13**

Using SQL query, retrieve the video published numbers for each year from a ytdata table

and name the video published numbers column as video_published

In [None]:
pd.read_sql_query("""Select publish_year, count(title) as video_published
                   from ytdata
                   group by publish_year""",conn)

Unnamed: 0,publish_year,video_published
0,2015,25
1,2016,8
2,2017,12
3,2018,56
4,2019,52
5,2020,38
6,2021,32
7,2022,11
8,2023,12


**Task 14**

Using SQL query, retrieve the total revenue for each month of year 2021 from a ytdata table.

Columns to be added - month , year, and revenue.

Name of the revenue column should be  revenue_earned.

In [None]:
pd.read_sql_query("""Select publish_month, publish_year, sum(revenue) as revenue_earned
                   from ytdata
                   where publish_year = 2021
                   group by publish_month""",conn)

Unnamed: 0,publish_month,publish_year,revenue_earned
0,1,2021,144.576
1,2,2021,68.49
2,3,2021,543.1
3,4,2021,545.842
4,6,2021,193.339
5,7,2021,227.602
6,8,2021,259.751
7,9,2021,165.38
8,10,2021,164.127
9,11,2021,550.234


**Task 15**

Using SQL query, retrieve the total views for each month in the year 2020 for videos of type 'Shorts' from ytdata table.

Columns to be added - month ,video type, year and views.

Name of the views column should be total_views.

Filters to be added - year and video type.

In [None]:
pd.read_sql_query("""Select publish_month, video_type, publish_year, sum(view) as total_views
                   from ytdata
                   where publish_year = 2020 and video_type = 'Shorts'
                   group by publish_month""",conn)

Unnamed: 0,publish_month,video_type,publish_year,total_views
0,1,Shorts,2020,1467.658
1,2,Shorts,2020,3296.684
2,3,Shorts,2020,637.567
3,4,Shorts,2020,1686.21
4,5,Shorts,2020,347.71
5,6,Shorts,2020,813.914
6,8,Shorts,2020,1195.763
7,9,Shorts,2020,2041.784
8,10,Shorts,2020,1801.932
9,11,Shorts,2020,2494.292


# HAVING

**Task 16**

Using SQL query, retrieves the total views for each month from ytdata table and only includes months where the monthly total views exceed 10,000 thousand

Columns to be added - month and views.

Name of the views column should be total_views.

Filters to be added - view.

In [None]:
pd.read_sql_query("""Select publish_month, sum(view)
                   from ytdata
                   group by publish_month
                   having sum(view) > 10000""",conn)

Unnamed: 0,publish_month,sum(view)
0,4,11387.747
1,8,14388.547
2,10,11185.122
3,11,15825.545
4,12,12370.71


**Task 17**

Using SQL query, retrieves the total impression for each year from ytdata table and only includes years where the yearly total impression is less than 1,000 thousand

Columns to be added - year and impression.

Name of the impression should be total_impression.

In [None]:
pd.read_sql_query("""Select publish_year, sum(impression) as total_impression
                   from ytdata
                   group by publish_year
                   having sum(impression) < 1000""",conn)

Unnamed: 0,publish_year,total_impression
0,2015,973.896
1,2022,51.515
2,2023,581.484


**Task 18**


Using SQL query, retrieve the total video duration for each year from ytdata table of Long video type .

It should only include years where the yearly total video duration is less than 20000 seconds

Columns to be added - publish year and video duration.

Name of the video duration column should be video_duration_in_sec.

In [None]:
pd.read_sql_query("Select * from ytdata limit 10", conn)

Unnamed: 0,id,title,publish_day,publish_month,publish_year,video_type,duration_seconds,view,like,comment,impression,revenue
0,911,Life of a Bollywood Stuntman,24,4,2021,Shorts,861,847.173,28.765,1.289,1541.855,69.383
1,912,Ultimate Crorepati Hide And Seek Challenge,18,12,2021,Long,729,320.902,21.252,0.736,1399.132,62.961
2,913,Opening a Giant Mystery Box Worth 50 Lakhs,3,4,2021,Long,709,1017.456,31.108,1.628,671.521,30.218
3,914,Bank Heist Challenge: Winner Takes 10 Lakhs,26,9,2021,Long,482,500.089,23.596,1.206,1600.286,72.013
4,915,Press for a Chance to Win 10 Lakhs!,14,11,2020,Shorts,911,1455.27,27.624,0.839,6941.64,312.374
5,916,50 Hours Locked in Tihar Jail,26,6,2021,Long,754,794.648,22.14,0.949,1517.778,68.3
6,917,50 Hours Buried in a Sand Dune Challenge,27,3,2021,Shorts,760,1543.227,53.79,2.336,6373.526,286.809
7,918,A Day in an Indian Bunker,11,4,2020,Long,719,1027.955,20.804,0.778,2168.984,97.604
8,919,My Sister's Wedding Tour,10,10,2020,Shorts,625,950.797,26.124,1.163,4326.127,194.676
9,920,CID Chase Challenge,7,8,2021,Long,1000,719.939,25.583,0.89,1115.906,50.216


In [None]:
pd.read_sql_query("""Select publish_year, sum(duration_seconds) as video_duration_in_sec
                   from ytdata
                   where video_type = 'Long'
                   group by publish_year
                   having sum(duration_seconds) < 20000
                   """,conn)

Unnamed: 0,publish_year,video_duration_in_sec
0,2015,1631
1,2016,1216
2,2020,14966
3,2021,16841
4,2022,204
5,2023,1475
