# `DATA EXPLORATION`

## **TOPIC: FILMS ANALYSIS**

`Group ID`: 17

`Group Member`:
- 22127404_Tạ Minh Thư
- 22127359_Chu Thúy Quỳnh
- 22127302_Nguyễn Đăng Nhân

In [None]:
import pandas as pd

### Read data

In [27]:
file = open('films_data.csv', 'r', encoding='utf-8-sig')

data = {}
first_line = file.readline().strip().split('\t')
for val in first_line:
    data[val] = []

for line in file:
    line_vals = line.strip().split('\t')   
    for i in range(len(line_vals)):
        data[first_line[i]].append(line_vals[i])

### The meaning of each row
Each row represents a specific movie, detailing information about its release, performance, genre, and key contributors (director, writer, and cast).

In [28]:
print('Number of rows:', len(data['Rank']))

Number of rows: 1000


### The meaning of each column
- `Ranks`: The film's rank in the top lifetime grosses.
- `Titles`: The film's name.
- `Foreign %`: The percentage of the foreign grosses in the film's worldwide grosses.
- `Domestic %`: The percentage of the domestic grosses in the film's worldwide grosses.
- `Years`: The year that the film was first released.
- `Genres`: The genre(s) associated with each film.
- `Directors`: The director(s) of each film.
- `Writers`: The writer(s) credited for each film.
- `Casts`: The main cast members of each film.

In [29]:
print('Number of columns:', len(data))

Number of columns: 9


### The datatype of each column

In [30]:
for col_name, col_value in data.items():
    print(f'{col_name}: {type(col_value[0])}')


Rank: <class 'str'>
Title: <class 'str'>
Foreign %: <class 'str'>
Domestic %: <class 'str'>
Year: <class 'str'>
Genre: <class 'str'>
Director: <class 'str'>
Writer: <class 'str'>
Cast: <class 'str'>


### Preprocessing data

- Convert Percentage Columns: Convert Foreign % and Domestic % to numeric values by removing the '%' symbol and changing the data type to floats. If the value in these columns is '-', it indicates that the foreign gross accounts for 100% of the film's worldwide grosses, and the domestic gross is considered 0%.

- Standardize Year Data Type: Ensure Year is an integer for easy numerical analysis.

- Split Genres: Split the values in Genre into separate columns or lists for better analysis of each genre individually.

- Director and Writer Parsing: If needed, split multiple directors or writers into lists to analyze individual contributions.

- Cast Parsing: Similarly, parse the Cast column into individual actor names or convert to lists, which will make it easier to analyze actor appearances across movies.

In [31]:
data['Rank'] = [int(rank.replace(',', '')) for rank in data['Rank']]

data['Foreign %'] = [float(value.rstrip('%')) for value in data['Foreign %']]

data['Domestic %'] = [
    float(value.replace('<', '').rstrip('%')) if value != '-' else 0 
    for value in data['Domestic %']
]

data['Year'] = [int(year) for year in data['Year']]

data['Genre'] = [genres.split(', ') for genres in data['Genre']]

data['Director'] = [directors.split(', ') for directors in data['Director']]

data['Writer'] = [writers.split(', ') for writers in data['Writer']]


data['Cast'] = [actors.split(', ') for actors in data['Cast']]


### New datatype of each column

In [32]:
for col_name, col_value in data.items():
    print(f'{col_name}: {type(col_value[0])}')

Rank: <class 'int'>
Title: <class 'str'>
Foreign %: <class 'float'>
Domestic %: <class 'float'>
Year: <class 'int'>
Genre: <class 'list'>
Director: <class 'list'>
Writer: <class 'list'>
Cast: <class 'list'>


### Check duplicated data

- `normalize_data` is created to make data comparisons easier and more consistent. This is done by cleaning up the data and making sure all values look the same regardless of formatting differences.
- The function processes each cell in the DataFrame row:
- If the cell contains a list:
    - It processes each item in the list by converting the item to lowercase.
    - Strips leading or trailing spaces.
    - Sorts the list elements using sorted() to ensure the order doesn’t matter (so ['b', 'a'] becomes ['a', 'b']).
    - Converts the cleaned list into a string format so that Pandas can handle it for comparisons.
- If the cell does not contain a list:
    - It simply converts the value to lowercase.
    - Strips any extra spaces.
- This cleaned and consistent version of the row is then returned.
- After that, we wil applies `normalize_data` function to each row of the DataFrame.
- The line `num_duplicated_rows = normalized_df.duplicated().sum()` checks for duplicates in the normalized DataFrame:
- `normalized_df.duplicated()` returns a Series (a single column of data) where each row is marked as True if it is a duplicate of a previous row.
- `.sum()` counts how many True values are in this Series, giving the total number of duplicated rows.
- If duplicates are found, it will create a new DataFrame containing all rows that are considered duplicates:
    - `keep=False` marks all rows as True if they are duplicates, including the first occurrence, so that all duplicate rows can be seen.
- The code then prints the duplicated rows for examination.
- If no duplicates are found, it simply prints a message saying, "No duplicated data found."

In [None]:
# Create a DataFrame from the data
df = pd.DataFrame(data)

def normalize_data(row):
    # Normalize each cell by converting to lowercase, stripping spaces,
    # sorting lists if any, and converting lists to strings for comparison
    return row.apply(
        lambda x: str(sorted([str(v).strip().lower() for v in x])) if isinstance(x, list) 
        else str(x).strip().lower()
    )

# Apply normalization to each row
normalized_df = df.apply(normalize_data, axis=1)

num_duplicated_rows = normalized_df.duplicated().sum()
print(f"The raw data has {num_duplicated_rows} duplicated rows (excluding the first occurrence in each group).")

if num_duplicated_rows > 0:
    duplicates = df[normalized_df.duplicated(keep=False)]
    print("Duplicated rows:")
    print(duplicates)
else:
    print("No duplicated data found.")

The raw data has 0 duplicated rows (excluding the first occurrence in each group).
No duplicated data found.


## Check missing data

- `check_missing_data` takes one argument `data`, which is a dictionary where the key is the column name, and the value is a list of values for that column.
- `missing_data_results`: a dictionary to store the names of columns that have missing data.
- `missing_ratio_results`: a dictionary to store the missing data ratio for each column.
- `missing_indices`: a list to store the indices of missing values in the current column.
- The `for` loop iterates over each value in the column, using `enumerate` to get both the index and the value.
- Missing data conditions:
  + The value is `None`.
  + The value is an empty string or a string with only whitespace.
- If a missing value is found, its index is appended to `missing_indices`.
- If any missing indices are found for the current column, they are added to `missing_data_results`, with the column name as the key and the list of missing indices as the value.
- Missing data ratio:
  + For each column, the ratio of missing data is calculated by dividing the number of missing values by the total number of values in the column.
  + This ratio is saved in `missing_ratio_results` as a floating-point value.
- If no missing data is found, the function prints `"No missing data found"`. On the other hand, if missing data is present, it displays the column name and the indices where the missing data is located.

In [None]:
def check_missing_data(data):
    columns = list(data.keys())
    missing_data = {}
    missing_ratio_results = {}

    for col in columns:
        col_values = data[col]
        missing_indices = []

        for index, value in enumerate(col_values):
            if value is None or (isinstance(value, str) and value.strip() == ''):
                missing_indices.append(index)
                
        if missing_indices:
            missing_data[col] = missing_indices
        missing_ratio = len(missing_indices) / len(col_values)
        missing_ratio_results[col] = missing_ratio

    if not missing_data:
        print("No missing data found")
    
    for col, indices in missing_data.items():
        print(f"Missing data in column '{col}' at indices: {indices}")
    print("\nMissing ratio per column:")
    
    for col, ratio in missing_ratio_results.items():
        print(f"Column '{col}' missing ratio: {ratio:.2%}") 
        
    return missing_data, missing_ratio_results
missing_data, missing_ratio_results = check_missing_data(data)

No missing data found

Missing ratio per column:
Column 'Rank' missing ratio: 0.00%
Column 'Title' missing ratio: 0.00%
Column 'Foreign %' missing ratio: 0.00%
Column 'Domestic %' missing ratio: 0.00%
Column 'Year' missing ratio: 0.00%
Column 'Genre' missing ratio: 0.00%
Column 'Director' missing ratio: 0.00%
Column 'Writer' missing ratio: 0.00%
Column 'Cast' missing ratio: 0.00%


- Because of the missing ratio of all columns is 0.00% and there's no duplicated data, it means there is no missing data. The dataset is ready for analysis without the need for imputation or data cleaning related to missing values or duplicated data.