### Merge Match Data

This notebook merges the metadata with the matching data.

In [1]:
import pandas as pd
import os
import json
import re
import csv
import matplotlib.pyplot as plt
import matplotlib.pylab as pl
import matplotlib.gridspec as gridspec

In [2]:
METADATA_DIR = "../../data/8_screenplays"

df_meta = pd.read_csv(f'{METADATA_DIR}/1_validation/clean_validated.csv', dtype={'imdb_id': str, 'id_merged': str})

In [3]:
df_meta = df_meta.drop(['match', 'alt_id', 'notes'], axis=1)

In [4]:
df_meta.head()

Unnamed: 0,imdb_id,title,script_url,filename,id_merged,char_fname
0,417385,12 and Holding,https://imsdb.com/scripts/12-and-Holding.html,12-and-Holding,417385,12-and-Holding_charinfo.txt
1,2024544,12 Years a Slave,https://imsdb.com/scripts/12-Years-a-Slave.html,12-Years-a-Slave,2024544,12-Years-a-Slave_charinfo.txt
2,1542344,127 Hours,https://imsdb.com/scripts/127-Hours.html,127-Hours,1542344,127-Hours_charinfo.txt
3,179626,15 Minutes,https://imsdb.com/scripts/15-Minutes.html,15-Minutes,179626,15-Minutes_charinfo.txt
4,974661,17 Again,https://imsdb.com/scripts/17-Again.html,17-Again,974661,17-Again_charinfo.txt


Now, we'll look at the most recent subset of movies in our pipeline with "high coverage and matching"

In [6]:
df_stats = pd.read_csv(f'{METADATA_DIR}/3_character_matching/movies_with_high_coverage_and_matching/imdb_and_stats.csv', dtype={'imdb': str})

In [7]:
df_stats.head()

Unnamed: 0,imdb,high_matches,avg_score,coverage
0,21214,0.545455,70.818182,0.972222
1,22054,0.521739,71.478261,0.959559
2,22958,0.75,86.944444,0.977011
3,24216,0.514286,74.514286,0.970219
4,25878,0.714286,82.632653,0.977593


In [8]:
df_stats.shape

(1367, 4)

In [9]:
stats_meta = df_stats.merge(df_meta, left_on='imdb', right_on='id_merged').drop(['id_merged', 'imdb_id'], axis=1)

In [10]:
stats_meta.head()

Unnamed: 0,imdb,high_matches,avg_score,coverage,title,script_url,filename,char_fname
0,21214,0.545455,70.818182,0.972222,One Good Turn,https://sfy.ru/?script=onegoodturn,One-Good-Turn,One-Good-Turn_charinfo.txt
1,22054,0.521739,71.478261,0.959559,The Last Flight of Noah's Ark,https://imsdb.com/scripts/Last-Flight%2C-The.html,Last-Flight-The,Last-Flight-The_charinfo.txt
2,22958,0.75,86.944444,0.977011,The Grand Budapest Hotel,https://sfy.ru/?script=grand_hotel_1932,Grand-Hotel,Grand-Hotel_charinfo.txt
3,24216,0.514286,74.514286,0.970219,King Kong,https://www.dailyscript.com/scripts/kong1933.html,Kong-(King-Kong),Kong-(King-Kong)_charinfo.txt
4,25878,0.714286,82.632653,0.977593,The Thin Man,https://www.dailyscript.com/scripts/thethinman...,The-Thin-Man,The-Thin-Man_charinfo.txt


In [11]:
stats_meta.shape

(1366, 8)

In [12]:
stats_meta[stats_meta['high_matches'] > 0.8]

Unnamed: 0,imdb,high_matches,avg_score,coverage,title,script_url,filename,char_fname
8,0031725,0.967742,92.322581,0.987118,Ninotchka,https://imsdb.com/scripts/Ninotchka.html,Ninotchka,Ninotchka_charinfo.txt
11,0032551,0.919355,92.887097,0.971596,The Grapes of Wrath,https://imsdb.com/scripts/Grapes-of-Wrath%2C-T...,Grapes-of-Wrath-The,Grapes-of-Wrath-The_charinfo.txt
19,0036104,0.857143,87.942857,0.958199,The Leopard Man,https://sfy.ru/?script=leopard_man,Leopard-Man-The,Leopard-Man-The_charinfo.txt
22,0036613,0.900000,89.100000,0.996859,Arsenic and Old Lace,https://imsdb.com/scripts/Arsenic-and-Old-Lace...,Arsenic-and-Old-Lace,Arsenic-and-Old-Lace_charinfo.txt
24,0036775,0.954545,89.863636,0.992388,Double Indemnity,https://sfy.ru/?script=double_indemnity_1944,Double-Indemnity,Double-Indemnity_charinfo.txt
...,...,...,...,...,...,...,...,...
1356,8579674,0.971429,97.057143,0.952224,1917,https://script-pdf.s3-us-west-2.amazonaws.com/...,1917,1917_charinfo.txt
1358,8722346,0.846154,92.384615,0.989796,Queen & Slim,https://script-pdf.s3-us-west-2.amazonaws.com/...,Queen-and-Slim,Queen-and-Slim_charinfo.txt
1360,8946378,0.913043,89.869565,0.943590,Knives Out,https://script-pdf.s3-us-west-2.amazonaws.com/...,Knives-Out,Knives-Out_charinfo.txt
1361,9620292,0.888889,95.296296,0.972749,Promising Young Woman,https://script-pdf.s3-us-west-2.amazonaws.com/...,Promising-Young-Woman,Promising-Young-Woman_charinfo.txt


In [None]:
stats_meta.to_csv(f'{METADATA_DIR}/3_character_matching/movies_with_high_coverage_and_matching/meta_imdb_and_stats.csv', index=False)