# `DATA EXPLORATION`

## **TOPIC: FILMS ANALYSIS**

`Group ID`: 17

`Group Member`:
- 22127404_Tạ Minh Thư
- 22127359_Chu Thúy Quỳnh
- 22127302_Nguyễn Đăng Nhân

### Read data

In [16]:
file = open('films_data.csv', 'r', encoding='utf-8-sig')

data = {}
first_line = file.readline().strip().split('\t')
for val in first_line:
    data[val] = []

for line in file:
    line_vals = line.strip().split('\t')   
    for i in range(len(line_vals)):
        data[first_line[i]].append(line_vals[i])

### The meaning of each row
Each row represents a specific movie, detailing information about its release, performance, genre, and key contributors (director, writer, and cast).

In [7]:
print('Number of rows:', len(data['Rank']))

Number of rows: 1000


### The meaning of each column
- `Ranks`: The film's rank in the top lifetime grosses.
- `Titles`: The film's name.
- `Foreign %`: The percentage of the foreign grosses in the film's worldwide grosses.
- `Domestic %`: The percentage of the domestic grosses in the film's worldwide grosses.
- `Years`: The year that the film was first released.
- `Genres`: The genre(s) associated with each film.
- `Directors`: The director(s) of each film.
- `Writers`: The writer(s) credited for each film.
- `Casts`: The main cast members of each film.

In [8]:
print('Number of columns:', len(data))

Number of columns: 9


### The datatype of each column

In [9]:
for col_name, col_value in data.items():
    print(f'{col_name}: {type(col_value[0])}')


Rank: <class 'str'>
Title: <class 'str'>
Foreign %: <class 'str'>
Domestic %: <class 'str'>
Year: <class 'str'>
Genre: <class 'str'>
Director: <class 'str'>
Writer: <class 'str'>
Cast: <class 'str'>


### Preprocessing data

- Convert Percentage Columns: Convert Foreign % and Domestic % to numeric values by removing the '%' symbol and changing the data type to floats. If the value in these columns is '-', it indicates that the foreign gross accounts for 100% of the film's worldwide grosses, and the domestic gross is considered 0%.

- Standardize Year Data Type: Ensure Year is an integer for easy numerical analysis.

- Split Genres: Split the values in Genre into separate columns or lists for better analysis of each genre individually.

- Director and Writer Parsing: If needed, split multiple directors or writers into lists to analyze individual contributions.

- Cast Parsing: Similarly, parse the Cast column into individual actor names or convert to lists, which will make it easier to analyze actor appearances across movies.

In [10]:
data['Rank'] = [int(rank.replace(',', '')) for rank in data['Rank']]

data['Foreign %'] = [float(value.rstrip('%')) for value in data['Foreign %']]

data['Domestic %'] = [
    float(value.replace('<', '').rstrip('%')) if value != '-' else 0 
    for value in data['Domestic %']
]

data['Year'] = [int(year) for year in data['Year']]

data['Genre'] = [genres.split(', ') for genres in data['Genre']]

data['Director'] = [directors.split(', ') for directors in data['Director']]

data['Writer'] = [writers.split(', ') for writers in data['Writer']]


data['Cast'] = [actors.split(', ') for actors in data['Cast']]


### New datatype of each column

In [11]:
for col_name, col_value in data.items():
    print(f'{col_name}: {type(col_value[0])}')

Rank: <class 'int'>
Title: <class 'str'>
Foreign %: <class 'float'>
Domestic %: <class 'float'>
Year: <class 'int'>
Genre: <class 'list'>
Director: <class 'list'>
Writer: <class 'list'>
Cast: <class 'list'>


### Check missing and duplicate data

In [23]:
def find_duplicates(data):
    columns = list(data.keys())  
    num_columns = len(columns)
    
    duplicate_results = {}
    
    for i in range(num_columns):
        for j in range(i + 1, num_columns):
            col1 = columns[i]
            col2 = columns[j]
            
            col1_values = data[col1]
            col2_values = data[col2]
            
            duplicates = []
            
            for value1, value2 in zip(col1_values, col2_values):
                if isinstance(value1, list) and isinstance(value2, list):
                    if sorted([str(v).strip().lower() for v in value1]) == sorted([str(v).strip().lower() for v in value2]):
                        duplicates.append(value1)
                elif str(value1).strip().lower() == str(value2).strip().lower():
                    duplicates.append(value1)
            
            if duplicates:
                duplicate_results[(col1, col2)] = duplicates
    
    if not duplicate_results:
        print("No duplicated data found")
    else:
        for col_pair, duplicates in duplicate_results.items():
            print(f"Duplicates between {col_pair[0]} and {col_pair[1]}: {duplicates}")

    return duplicate_results

duplicate_results = find_duplicates(data)

Duplicates between Foreign % and Domestic %: ['50%']
Duplicates between Director and Writer: ['James Cameron', 'James Cameron', 'Brad Bird', 'George Lucas', 'George Lucas', 'Christopher Nolan', 'George Lucas', 'M. Night Shyamalan', 'Brad Bird', 'Christopher Nolan', 'Damien Chazelle', 'Luc Besson', 'Stephen Sommers', 'Quentin Tarantino', 'M. Night Shyamalan', 'Quentin Tarantino', 'Christopher Nolan', 'Yang Song, Chiyu Zhang', 'Quentin Tarantino', 'M. Night Shyamalan', 'Rian Johnson', 'Paul W.S. Anderson', 'Rawson Marshall Thurber', 'Sylvester Stallone', 'Paul W.S. Anderson', 'Stephen Sommers', 'Amy Heckerling', 'Neill Blomkamp', 'M. Night Shyamalan', 'Cameron Crowe', 'Nancy Meyers', 'J.J. Abrams', 'Stanley Tong', 'M. Night Shyamalan', 'Jordan Peele', 'Han Han', 'Jordan Peele', 'Simon Kinberg', 'Richard Curtis', 'M. Night Shyamalan', 'Paul W.S. Anderson', 'Paul Feig', 'Da-Mo Peng, Fei Yan', 'Edgar Wright', 'Michael Moore', 'Judd Apatow', 'Nancy Meyers', 'Parker Finn', 'David Ayer', 'Alej

## Check missing data