# `DATA EXPLORATION`

## **TOPIC: FILMS ANALYSIS**

`Group ID`: 17

`Group Member`:
- 22127404_Tạ Minh Thư
- 22127359_Chu Thúy Quỳnh
- 22127302_Nguyễn Đăng Nhân

### Read data

In [1]:
file = open('films_data.csv', 'r', encoding='utf-8-sig')

data = {}
first_line = file.readline().strip().split('\t')
for val in first_line:
    data[val] = []

for line in file:
    line_vals = line.strip().split('\t')   
    for i in range(len(line_vals)):
        data[first_line[i]].append(line_vals[i])

### The meaning of each row
Each row represents a specific movie, detailing information about its release, performance, genre, and key contributors (director, writer, and cast).

In [2]:
print('Number of rows:', len(data['Rank']))

Number of rows: 1000


### The meaning of each column
- `Ranks`: The film's rank in the top lifetime grosses.
- `Titles`: The film's name.
- `Foreign %`: The percentage of the foreign grosses in the film's worldwide grosses.
- `Domestic %`: The percentage of the domestic grosses in the film's worldwide grosses.
- `Years`: The year that the film was first released.
- `Genres`: The genre(s) associated with each film.
- `Directors`: The director(s) of each film.
- `Writers`: The writer(s) credited for each film.
- `Casts`: The main cast members of each film.

In [3]:
print('Number of columns:', len(data))

Number of columns: 9


### The datatype of each column

In [4]:
for col_name, col_value in data.items():
    print(f'{col_name}: {type(col_value[0])}')


Rank: <class 'str'>
Title: <class 'str'>
Foreign %: <class 'str'>
Domestic %: <class 'str'>
Year: <class 'str'>
Genre: <class 'str'>
Director: <class 'str'>
Writer: <class 'str'>
Cast: <class 'str'>


### Preprocessing data

- Convert Percentage Columns: Convert Foreign % and Domestic % to numeric values by removing the '%' symbol and changing the data type to floats. If the value in these columns is '-', it indicates that the foreign gross accounts for 100% of the film's worldwide grosses, and the domestic gross is considered 0%.

- Standardize Year Data Type: Ensure Year is an integer for easy numerical analysis.

- Split Genres: Split the values in Genre into separate columns or lists for better analysis of each genre individually.

- Director and Writer Parsing: If needed, split multiple directors or writers into lists to analyze individual contributions.

- Cast Parsing: Similarly, parse the Cast column into individual actor names or convert to lists, which will make it easier to analyze actor appearances across movies.

In [5]:
data['Rank'] = [int(rank.replace(',', '')) for rank in data['Rank']]

data['Foreign %'] = [float(value.rstrip('%')) for value in data['Foreign %']]

data['Domestic %'] = [
    float(value.replace('<', '').rstrip('%')) if value != '-' else 0 
    for value in data['Domestic %']
]

data['Year'] = [int(year) for year in data['Year']]

data['Genre'] = [genres.split(', ') for genres in data['Genre']]

data['Director'] = [directors.split(', ') for directors in data['Director']]

data['Writer'] = [writers.split(', ') for writers in data['Writer']]


data['Cast'] = [actors.split(', ') for actors in data['Cast']]


### New datatype of each column

In [6]:
for col_name, col_value in data.items():
    print(f'{col_name}: {type(col_value[0])}')

Rank: <class 'int'>
Title: <class 'str'>
Foreign %: <class 'float'>
Domestic %: <class 'float'>
Year: <class 'int'>
Genre: <class 'list'>
Director: <class 'list'>
Writer: <class 'list'>
Cast: <class 'list'>


### Check duplicate data

- `find_duplicates(data)` accepts a data object, which is a dictionary where the key is the column name, and the value is a list of values corresponding to that column.
- `columns`: A list of column names (keys from the data dictionary).
- `num_columns`: The number of columns in data.
- Two nested for loops to iterate through each pair of columns `(col1, col2)` without repeating combinations.
- `col1` and `col2` represent the names of the columns being compared.
- Then retrieve the list of values for each column.
- Initialize `duplicates` to store any values that are duplicated between the two columns.
- Use `zip()` to iterate through each pair of values from the two columns.
- If `value1` and `value2` are lists:
    - Convert the values in the lists to lowercase, remove extra spaces, and sort them.
    - If they are identical after sorting, add value1 to the duplicates list.
- If value1 and value2 are not lists:
    - Compare the two values after converting them to lowercase and trimming whitespace.
    - If they match, add value1 to duplicates.
- If no duplicates are found, print `No duplicated data found` and if any duplicates are found, save the column pair and the list of duplicate value in `duplicate_results`.
- If duplicates are found, display each pair of columns with duplicate values and the list of those duplicates.
- Returns a dictionary containing information about the column pairs with duplicate values and the duplicates found between them.

In [7]:
def find_duplicates(data):
    columns = list(data.keys())  
    num_columns = len(columns)
    
    duplicate_results = {}
    
    for i in range(num_columns):
        for j in range(i + 1, num_columns):
            col1 = columns[i]
            col2 = columns[j]
            
            col1_values = data[col1]
            col2_values = data[col2]
            
            duplicates = []
            
            for value1, value2 in zip(col1_values, col2_values):
                if isinstance(value1, list) and isinstance(value2, list):
                    if sorted([str(v).strip().lower() for v in value1]) == sorted([str(v).strip().lower() for v in value2]):
                        duplicates.append(value1)
                elif str(value1).strip().lower() == str(value2).strip().lower():
                    duplicates.append(value1)
            
            if duplicates:
                duplicate_results[(col1, col2)] = duplicates
    
    if not duplicate_results:
        print("No duplicated data found")
    else:
        for col_pair, duplicates in duplicate_results.items():
            print(f"Duplicates between {col_pair[0]} and {col_pair[1]}: {duplicates}")

    return duplicate_results

duplicate_results = find_duplicates(data)

Duplicates between Foreign % and Domestic %: [50.0]
Duplicates between Director and Writer: [['James Cameron'], ['James Cameron'], ['Brad Bird'], ['George Lucas'], ['George Lucas'], ['Christopher Nolan'], ['George Lucas'], ['M. Night Shyamalan'], ['Brad Bird'], ['Christopher Nolan'], ['Damien Chazelle'], ['Luc Besson'], ['Lana Wachowski', 'Lilly Wachowski'], ['Stephen Sommers'], ['Quentin Tarantino'], ['M. Night Shyamalan'], ['Quentin Tarantino'], ['Christopher Nolan'], ['Yang Song', 'Chiyu Zhang'], ['Quentin Tarantino'], ['M. Night Shyamalan'], ['Rian Johnson'], ['Paul W.S. Anderson'], ['Rawson Marshall Thurber'], ['Sylvester Stallone'], ['Paul W.S. Anderson'], ['Stephen Sommers'], ['Amy Heckerling'], ['Neill Blomkamp'], ['M. Night Shyamalan'], ['Cameron Crowe'], ['Nancy Meyers'], ['J.J. Abrams'], ['Stanley Tong'], ['M. Night Shyamalan'], ['Jordan Peele'], ['Han Han'], ['Jordan Peele'], ['Simon Kinberg'], ['Richard Curtis'], ['M. Night Shyamalan'], ['Paul W.S. Anderson'], ['Paul Feig'

## Check missing data

- `check_missing_data(data)` takes one argument is `data`: a dictionary where the key is the column name, and the value is a list of values for that column.
- `columns`: a list containing the names of the columns.
- `missing_data_results`: a dictionary store the names of columns with missing data.
- `col_values`: a list of values for the current column.
- `missing_indices`: a list to store the indices of missing values.
- The `for` loop goes through each value in the column, using enumerate to get both the index and the value.
- Missing data conditions:
    + The value is None.
    + The value is an empty string or a string with only whitespace.
- If a missing value is found, its index is appended to missing_indices.
- If any missing indices are found in the current column, they are added to `missing_data_results` with the column name as the key and the list of missing indices as the value.
- If no missing data is found, the function prints `No missing data found`. On the other hand, if missing data is present, it displays the column name and the indices.
- Returning a dictionary that contains the column names as keys and the lists of indices with missing data as values.

In [8]:
def check_missing_data(data):
    columns = list(data.keys())  
    missing_data_results = {}
    
    for col in columns:
        col_values = data[col]
        missing_indices = []

        for index, value in enumerate(col_values):
            if value is None or (isinstance(value, str) and value.strip() == ''):
                missing_indices.append(index)
        
        if missing_indices:
            missing_data_results[col] = missing_indices

    if not missing_data_results:
        print("No missing data found")
    else:
        for col, indices in missing_data_results.items():
            print(f"Missing data in column '{col}' at indices: {indices}")

    return missing_data_results

missing_data_results = check_missing_data(data)


No missing data found
