<a href="https://colab.research.google.com/github/rahul-rohilla1/IIT-Work/blob/main/TDS_W6_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import and Data Loading

In [None]:
import pandas as pd
import numpy as np
from itertools import permutations

The following dataset I used can be found [here](https://data.world/adrianmcmahon/imdb-dataset-all-indian-movies)

After further preprocessing such as:
- removing unnecessary columns
- removing the rows with Year missing
- considering only the movies released >= 2005


 **You get the following dataset:**


In [None]:
# Cleaned Data
!gdown 1iF1HSol0S2Wc81ifHCDv4ylsIM89dXMv

Downloading...
From: https://drive.google.com/uc?id=1iF1HSol0S2Wc81ifHCDv4ylsIM89dXMv
To: /content/Bollywood_Movies_Cleaned.csv
  0% 0.00/255k [00:00<?, ?B/s]100% 255k/255k [00:00<00:00, 101MB/s]


In [None]:
# Paste your dataset path here.
df = pd.read_csv(r"/content/Bollywood_Movies_Cleaned.csv")
df.drop_duplicates(inplace=True)
df

Unnamed: 0,Name,Year,Actor1,Actor2,Actor3
0,#Gadhvi (He thought he was Gandhi),2019,Rasika Dugal,Vivek Ghamande,Arvind Jangid
1,#Homecoming,2021,Sayani Gupta,Plabita Borthakur,Roy Angana
2,#Yaaram,2019,Prateik,Ishita Raj,Siddhant Kapoor
3,...And Once Again,2010,Rajat Kapoor,Rituparna Sengupta,Antara Mali
4,...Yahaan,2005,Jimmy Sheirgill,Minissha Lamba,Yashpal Sharma
...,...,...,...,...,...
4088,Zinga,2021,Sri Ram,Devan Sanjeev,Kasu Naveen Kumar
4089,Zokkomon,2011,Darsheel Safary,Anupam Kher,Manjari Fadnnis
4090,Zoo,2018,Shashank Arora,Prince Daniel,Shatakshi Gupta
4091,Zor Lagaa Ke... Haiya!,2009,Meghan Jadhav,Mithun Chakraborty,Riya Sen


## Permutation of Actors

Now, Actor1, Actor2, Actor3 are **directly linked** to each other since they have worked together in a movie. 

What we need is a **permutation** of directly linked actors (worked together). For each movie, if we have 3 actors, **<sup>3</sup>P<sub>2</sub> = 6**.

Hence, we will have 6 entries for each movies to establish a connection, considering that each movie has 3 actors

In [None]:
all_permutations = []
for row in df.itertuples():
    # film_actors consist of [Actor1, Actor2, Actor3]
    film_actors = list(row)[3:6]
    # Taking 3P2 permutation of actors
    movie_actors_perm = permutations(film_actors, 2)
    all_permutations += movie_actors_perm
all_permutations[:10]

[('Rasika Dugal', 'Vivek Ghamande'),
 ('Rasika Dugal', 'Arvind Jangid'),
 ('Vivek Ghamande', 'Rasika Dugal'),
 ('Vivek Ghamande', 'Arvind Jangid'),
 ('Arvind Jangid', 'Rasika Dugal'),
 ('Arvind Jangid', 'Vivek Ghamande'),
 ('Sayani Gupta', 'Plabita Borthakur'),
 ('Sayani Gupta', 'Roy Angana'),
 ('Plabita Borthakur', 'Sayani Gupta'),
 ('Plabita Borthakur', 'Roy Angana')]

In [None]:
connections = pd.DataFrame(all_permutations, columns=['From', 'To'])

In [None]:
connections

Unnamed: 0,From,To
0,Rasika Dugal,Vivek Ghamande
1,Rasika Dugal,Arvind Jangid
2,Vivek Ghamande,Rasika Dugal
3,Vivek Ghamande,Arvind Jangid
4,Arvind Jangid,Rasika Dugal
...,...,...
24553,Vicky Kaushal,Raaghavv Chanana
24554,Sarah Jane Dias,Vicky Kaushal
24555,Sarah Jane Dias,Raaghavv Chanana
24556,Raaghavv Chanana,Vicky Kaushal


## Removing NaNs and Duplicates


In [None]:
print("NaNs in Column:\n From\t To\n",connections.From.isna().sum(),'\t', connections.To.isna().sum()) # movies that didn't have three actors will have nans in permutation
print('\nSize:', connections.shape)

NaNs in Column:
 From	 To
 384 	 384

Size: (24558, 2)


In [None]:
connections.dropna(inplace=True)
print('Size after removing NaNs', connections.shape)

Size after removing NaNs (23790, 2)


# Submission file for Kumu


### With Strength

In [None]:
connections_w_strength = connections.copy()
connections_w_strength['Strength'] = np.ones(connections.shape[0])
connections_w_strength = connections_w_strength\
                        .groupby(['From', 'To'])\
                        .sum()\
                        .reset_index()\
                        .sort_values('Strength', ascending=False)
connections_w_strength.to_csv('connections_w_strength.csv', index=None, mode='x') # avoids overwriting the existing files
connections_w_strength

Unnamed: 0,From,To,Strength
8692,Kavita Joshi,Uttar Kumar,12.0
21135,Uttar Kumar,Kavita Joshi,12.0
21129,Uttar Kumar,Dev Sharma,7.0
8623,Katrina Kaif,Akshay Kumar,7.0
1148,Akshay Kumar,Katrina Kaif,7.0
...,...,...,...
7677,Jaya Bachchan,Om Puri,1.0
7676,Jaya Bachchan,Mallika Sarabhai,1.0
7675,Jaya Bachchan,Madhavan,1.0
7674,Jaya Bachchan,Konkona Sen Sharma,1.0


### Without Strength

In [None]:
connections.drop_duplicates(inplace=True)
connections.to_csv('connections.csv', index=None, mode='x') # avoids overwriting the existing files