## SQL Assignment

using Jupyter Magics with SQL

**Dependencies:**

* pip install ipython-sql

In [1]:
#loading SQL module
%load_ext sql

#connect to the database
%sql sqlite:///Db-IMDB-Assignment.db

'Connected: @Db-IMDB-Assignment.db'

### IMDB-Data-Cleaning

For those who want to remove the Unwanted ( Roman symbols ) from the "Year" column and spaces in the name column, pid and mid
in all the required tables for the assignment, you can do this

### Q1 --- List all the directors who directed a 'Comedy' movie in a leap year. (You need to check that the genre is 'Comedy’ and year is a leap year) Your query should return director name, the movie name, and the year.

In [2]:
%%time
%%sql

SELECT P.Name AS 'Director', M.title, M."year"
FROM Person P 
JOIN M_Director MD ON MD.PID = P.PID 
JOIN Movie M ON M.MID = MD.MID 
JOIN M_Genre MG ON MG.MID = M.MID
JOIN Genre G ON G.GID = MG.GID 
WHERE G.Name LIKE '%Comedy%'
AND M."year" % 4 = 0
LIMIT 7;

 * sqlite:///Db-IMDB-Assignment.db
Done.
Wall time: 93.7 ms


Director,title,year
Milap Zaveri,Mastizaade,2016
Danny Leiner,Harold & Kumar Go to White Castle,2004
Anurag Kashyap,Gangs of Wasseypur,2012
Frank Coraci,Around the World in 80 Days,2004
Griffin Dunne,The Accidental Husband,2008
Anurag Basu,Barfi!,2012
Gurinder Chadha,Bride & Prejudice,2004


### Q2 --- List the names of all the actors who played in the movie 'Anand' (1971)

In [3]:
%%time
%%sql

SELECT P.Name
FROM Person P 
JOIN M_Cast MC ON P.PID = TRIM(MC.PID) 
JOIN Movie M ON M.MID = MC.MID 
WHERE M.title = 'Anand' AND M."year" = 1971
LIMIT 7;

 * sqlite:///Db-IMDB-Assignment.db
Done.
Wall time: 235 ms


Name
Amitabh Bachchan
Rajesh Khanna
Brahm Bhardwaj
Ramesh Deo
Seema Deo
Dev Kishan
Durga Khote


### Q3 --- List all the actors who acted in a film before 1970 and in a film after 1990. (That is: < 1970 and > 1990.)

In [4]:
%%time
%%sql

WITH 
CAST_ON_1970 AS (SELECT TRIM(P.PID) FROM Person P JOIN M_Cast MC ON P.PID = TRIM(MC.PID) JOIN Movie M ON M.MID = MC.MID WHERE M."year" < 1970),
CAST_ON_1990 AS (SELECT TRIM(P.PID) FROM Person P JOIN M_Cast MC ON P.PID = TRIM(MC.PID) JOIN Movie M ON M.MID = MC.MID WHERE M."year" > 1990)

SELECT P.Name AS 'Actor'
FROM Person P
WHERE TRIM(PID) IN CAST_ON_1970 AND TRIM(P.PID) IN CAST_ON_1990
LIMIT 7;

 * sqlite:///Db-IMDB-Assignment.db
Done.
Wall time: 670 ms


Actor
Rishi Kapoor
Amitabh Bachchan
Asrani
Zohra Sehgal
Parikshat Sahni
Rakesh Sharma
Sanjay Dutt


### Q4 --- List all directors who directed 10 movies or more, in descending order of the number of movies they directed. Return the directors' names and the number of movies each of them directed.

In [5]:
%%time
%%sql

WITH 
DIRECTOR_MOVIES AS (SELECT MD.PID, COUNT(*) AS Movie_Count FROM M_Director MD GROUP BY MD.PID HAVING COUNT(*) > 10)

SELECT P.Name, DM.Movie_Count 
FROM Person P 
JOIN DIRECTOR_MOVIES DM ON P.PID = DM.PID
LIMIT 7;

 * sqlite:///Db-IMDB-Assignment.db
Done.
Wall time: 7.98 ms


Name,Movie_Count
Mahesh Manjrekar,15
Satish Kaushik,12
Anurag Kashyap,13
Yash Chopra,21
Subhash Ghai,18
Rakesh Roshan,13
Madhur Bhandarkar,12


### Q5.a --- For each year, count the number of movies in that year that had only female actors.

In [6]:
%%time
%%sql

-- "Filtering all the Movies with Male and Eliminating those to find the only female cast movies"

WITH 
NON_FEMALE_MOVIES AS (SELECT TRIM(MC.MID) FROM M_Cast MC INNER JOIN Person P ON P.PID = TRIM(MC.PID) WHERE P.Gender in ('Male', NULL) GROUP BY MC.MID),
FEMALE_MOVIES AS (SELECT M.MID FROM Movie M INNER JOIN M_Cast MC ON TRIM(MC.MID) = M.MID WHERE TRIM(M.MID) NOT IN NON_FEMALE_MOVIES AND MC.PID NOTNULL GROUP BY M.MID)


SELECT M."year", COUNT(*) AS 'count'
FROM Movie M
WHERE TRIM(M.MID) IN FEMALE_MOVIES
GROUP BY M."year"
ORDER BY M."year"
LIMIT 7;

 * sqlite:///Db-IMDB-Assignment.db
Done.
Wall time: 380 ms


year,count
1939,1
1999,1
2000,1
2018,2


### Q5.b --- Now include a small change: report for each year the percentage of movies in that year with only female actors, and the total number of movies made that year. For example, one answer will be: 1990 31.81 13522 meaning that in 1990 there were 13,522 movies, and 31.81% had only female actors. You do not need to round your answer.

In [7]:
%%time
%%sql

--"1. Filtering all the Movies with Male and Eliminating those to find the only female cast movies"
--"2. Comparing the Female only movies with total movies made that particular year and its percentage"

WITH 
NON_FEMALE_MOVIES AS (SELECT TRIM(MC.MID) FROM M_Cast MC INNER JOIN Person P ON P.PID = TRIM(MC.PID) WHERE P.Gender in ('Male', NULL) GROUP BY MC.MID),
FEMALE_MOVIES AS (SELECT M.MID FROM Movie M INNER JOIN M_Cast MC ON TRIM(MC.MID) = M.MID WHERE TRIM(M.MID) NOT IN NON_FEMALE_MOVIES AND MC.PID NOTNULL GROUP BY M.MID),
ALL_YEARS AS (SELECT M."year", COUNT(*) AS 'total_movies' FROM Movie as M GROUP BY M."year")

SELECT M."year", COUNT(M."year") AS female_movies, AY.total_movies, COUNT(M."year") * 100 / AY.total_movies AS percent
FROM Movie M
INNER JOIN FEMALE_MOVIES FM ON FM.MID = M.MID
INNER JOIN ALL_YEARS AY ON M."year" = AY."year"
GROUP BY M."year"
ORDER BY M."year"
LIMIT 7;

 * sqlite:///Db-IMDB-Assignment.db
Done.
Wall time: 390 ms


year,female_movies,total_movies,percent
1939,1,2,50
1999,1,66,1
2000,1,64,1
2018,2,104,1


### Q6 --- Find the film(s) with the largest cast. Return the movie title and the size of the cast. By "cast size" we mean the number of distinct actors that played in that movie: if an actor played multiple roles, or if it simply occurs multiple times in casts, we still count her/him only once.

In [8]:
%%time
%%sql

WITH 
LARGER_CAST AS (SELECT COUNT(*) AS CAST_COUNT, MID FROM M_Cast GROUP BY MID)

SELECT LC.MID, M.TITLE, MAX(LC.CAST_COUNT) AS CAST_SIZE
FROM LARGER_CAST LC
JOIN Movie M ON M.MID = LC.MID 
LIMIT 7;

 * sqlite:///Db-IMDB-Assignment.db
Done.
Wall time: 65.8 ms


MID,title,CAST_SIZE
tt5164214,Ocean's Eight,238


### Q7 --- A decade is a sequence of 10 consecutive years. For example, say in your database you have movie information starting from 1965. Then the first decade is 1965, 1966, ..., 1974; the second one is 1967, 1968, ..., 1976 and so on. Find the decade D with the largest number of films and the total number of films in D.

In [9]:
%%time
%%sql

--"1. Filtering the Distinct year from movie table"
--"2. Adding +9 year to all unique year to find the decades largest produced movies"

WITH 
UNIQUE_YEAR AS (SELECT DISTINCT "year" FROM Movie)

SELECT D."year" AS START, D."year"+9 AS END, COUNT(*) AS 'COUNT' 
FROM UNIQUE_YEAR D
JOIN Movie M on M."year" >= START AND M."year"<= END
GROUP BY END 
ORDER BY COUNT(*) DESC 
LIMIT 1

 * sqlite:///Db-IMDB-Assignment.db
Done.
Wall time: 114 ms


START,END,COUNT
2008,2017,1205


### Q8 --- Find all the actors that made more movies with Yash Chopra than any other director.

In [10]:
%%time
%%sql

--"Filter the all actors that made more movies with Yash Chopra than any other"

SELECT TRIM(P.Name) AS ACTOR_NAME, COUNT(DISTINCT M.MID) AS YASH_CHOPRA_DIRECTED_MOVIES
FROM Person P 
JOIN M_Cast MC ON TRIM(MC.PID) = P.PID 
JOIN Movie M ON M.MID = MC.MID 
JOIN M_Director MD ON MD.MID = M.MID 
JOIN Person P1 ON P1.PID = TRIM(MD.PID)
WHERE TRIM(P1.Name) = 'Yash Chopra'
GROUP BY TRIM(P.PID)
ORDER BY COUNT(DISTINCT M.MID) DESC
LIMIT 7;

 * sqlite:///Db-IMDB-Assignment.db
Done.
Wall time: 570 ms


ACTOR_NAME,YASH_CHOPRA_DIRECTED_MOVIES
Jagdish Raj,11
Manmohan Krishna,10
Iftekhar,9
Madan Puri,8
Vikas Anand,8
Anupam Kher,7
Shashi Kapoor,7


### Q9 --- The Shahrukh number of an actor is the length of the shortest path between the actor and Shahrukh Khan in the "co-acting" graph. That is, Shahrukh Khan has Shahrukh number 0; all actors who acted in the same film as Shahrukh have Shahrukh number 1; all actors who acted in the same film as some actor with Shahrukh number 1 have Shahrukh number 2, etc. Return all actors whose Shahrukh number is 2.

In [11]:
%%time
%%sql

--"1. Row_Number() function is used To identify the cast number/priority"
--"2. Filter the all actors whose Shah Rukh Khan number is 2"

WITH 
Shah_rukh_khan_index AS (SELECT  P.PID, MC.MID, P.Name, ROW_NUMBER() OVER (PARTITION BY MC.MID) AS ROW_NUM FROM  Person P JOIN M_Cast MC ON TRIM(MC.PID) = P.PID)

SELECT MC.MID, M.title, ROW_NUMBER() OVER (PARTITION BY MC.MID) AS 'cast_number', P.Name
FROM Person P 
JOIN M_Cast MC ON TRIM(MC.PID) = P.PID 
JOIN Movie M ON M.MID = TRIM(MC.MID)
WHERE M.MID IN (SELECT MID FROM Shah_rukh_khan_index WHERE Name LIKE '%Shah%Rukh%Khan%' and ROW_NUM = 2)
LIMIT 7;

 * sqlite:///Db-IMDB-Assignment.db
Done.
Wall time: 747 ms


MID,title,cast_number,Name
tt0107321,King Uncle,1,Jackie Shroff
tt0107321,King Uncle,2,Shah Rukh Khan
tt0107321,King Uncle,3,Nagma
tt0107321,King Uncle,4,Sushmita Mukherjee
tt0107321,King Uncle,5,Deb Mukherjee
tt0107321,King Uncle,6,Deven Verma
tt0107321,King Uncle,7,Yunus Parvez
