# `DATA EXPLORATION`

## **TOPIC: FILMS ANALYSIS**

`Group ID`: 17

`Group Member`:
- 22127404_Tạ Minh Thư
- 22127359_Chu Thúy Quỳnh
- 22127302_Nguyễn Đăng Nhân

### Read data

In [1]:
file = open('films_data.csv', 'r', encoding='utf-8-sig')

data = {}
first_line = file.readline().strip().split('\t')
for val in first_line:
    data[val] = []

for line in file:
    line_vals = line.strip().split('\t')   
    for i in range(len(line_vals)):
        data[first_line[i]].append(line_vals[i])

### The meaning of each row
Each row represents a specific movie, detailing information about its release, performance, genre, and key contributors (director, writer, and cast).

In [2]:
print('Number of rows:', len(data['Rank']))

Number of rows: 1000


### The meaning of each column
- `Ranks`: The film's rank in the top lifetime grosses.
- `Titles`: The film's name.
- `Foreign %`: The percentage of the foreign grosses in the film's worldwide grosses.
- `Domestic %`: The percentage of the domestic grosses in the film's worldwide grosses.
- `Years`: The year that the film was first released.
- `Genres`: The genre(s) associated with each film.
- `Directors`: The director(s) of each film.
- `Writers`: The writer(s) credited for each film.
- `Casts`: The main cast members of each film.

In [3]:
print('Number of columns:', len(data))

Number of columns: 9


### The datatype of each column

In [4]:
for col_name, col_value in data.items():
    print(f'{col_name}: {type(col_value[0])}')


Rank: <class 'str'>
Title: <class 'str'>
Foreign %: <class 'str'>
Domestic %: <class 'str'>
Year: <class 'str'>
Genre: <class 'str'>
Director: <class 'str'>
Writer: <class 'str'>
Cast: <class 'str'>


### Preprocessing data

- Convert Percentage Columns: Convert Foreign % and Domestic % to numeric values by removing the '%' symbol and changing the data type to floats. If the value in these columns is '-', it indicates that the foreign gross accounts for 100% of the film's worldwide grosses, and the domestic gross is considered 0%.

- Standardize Year Data Type: Ensure Year is an integer for easy numerical analysis.

- Split Genres: Split the values in Genre into separate columns or lists for better analysis of each genre individually.

- Director and Writer Parsing: If needed, split multiple directors or writers into lists to analyze individual contributions.

- Cast Parsing: Similarly, parse the Cast column into individual actor names or convert to lists, which will make it easier to analyze actor appearances across movies.

In [5]:
data['Rank'] = [int(rank.replace(',', '')) for rank in data['Rank']]

data['Foreign %'] = [float(value.rstrip('%')) for value in data['Foreign %']]

data['Domestic %'] = [
    float(value.replace('<', '').rstrip('%')) if value != '-' else 0 
    for value in data['Domestic %']
]

data['Year'] = [int(year) for year in data['Year']]

data['Genre'] = [genres.split(', ') for genres in data['Genre']]

data['Director'] = [directors.split(', ') for directors in data['Director']]

data['Writer'] = [writers.split(', ') for writers in data['Writer']]


data['Cast'] = [actors.split(', ') for actors in data['Cast']]


### New datatype of each column

In [6]:
for col_name, col_value in data.items():
    print(f'{col_name}: {type(col_value[0])}')

Rank: <class 'int'>
Title: <class 'str'>
Foreign %: <class 'float'>
Domestic %: <class 'float'>
Year: <class 'int'>
Genre: <class 'list'>
Director: <class 'list'>
Writer: <class 'list'>
Cast: <class 'list'>


### Check duplicated data

- `find_duplicates` takes one argument `data`, which is a dictionaryy where the key is the column name, and the value is a list of values for that column.
- `rows`: transform the column-wise data into row-wise data. Each row become a tuple containing data from each column for a specific row.
- A nested loop structure is used to compare each pair of rows:
    - i: index of the first row
    - j: index of the second row that is compared with row1.
- `Normalization` is used to ensure consistent comparison between rows:
    - If the element is a list, it is converted to lowercase, stripped of leading/trailing spaces, and sorted. 
    - If the element is not a list, it is simply converted to lowercase and stripped of spaces.
- The pair of row indices `(i,j)` is used as the key in `duplicate_results`, and the original row data is stored as the value.
- If no duplicates are found, print `No duplicated data found` and if any duplicates are found, save the column pair and the list of duplicate value in `duplicate_results`.
- If duplicates are found, display the indices of the duplicated rows and the duplicated content.

In [None]:
def find_duplicates(data):
    rows = list(zip(*data.values()))
    num_rows = len(rows)
    duplicate_results = {}
    
    for i in range(num_rows):
        for j in range(i + 1, num_rows):
            row1 = rows[i]
            row2 = rows[j]
            normalized1 = [
                sorted([str(v).strip().lower() for v in value]) if isinstance(value, list) else str(value).strip().lower()
                for value in row1
            ]
            normalized2 = [
                sorted([str(v).strip().lower() for v in value]) if isinstance(value, list) else str(value).strip().lower()
                for value in row2
            ]
            if normalized1 == normalized2:
                duplicate_results[(i, j)] = row1
    
    if not duplicate_results:
        print("No duplicated data found")
    else:
        for row_pair, duplicate in duplicate_results.items():
            print(f"Duplicate rows between indices {row_pair[0]} and {row_pair[1]}: {duplicate}")

    return duplicate_results

duplicate_results = find_duplicates(data)

No duplicated rows found


## Check missing data

- `check_missing_data` takes one argument `data`, which is a dictionary where the key is the column name, and the value is a list of values for that column.
- `missing_data_results`: a dictionary to store the names of columns that have missing data.
- `missing_ratio_results`: a dictionary to store the missing data ratio for each column.
- `missing_indices`: a list to store the indices of missing values in the current column.
- The `for` loop iterates over each value in the column, using `enumerate` to get both the index and the value.
- Missing data conditions:
  + The value is `None`.
  + The value is an empty string or a string with only whitespace.
- If a missing value is found, its index is appended to `missing_indices`.
- If any missing indices are found for the current column, they are added to `missing_data_results`, with the column name as the key and the list of missing indices as the value.
- Missing data ratio:
  + For each column, the ratio of missing data is calculated by dividing the number of missing values by the total number of values in the column.
  + This ratio is saved in `missing_ratio_results` as a floating-point value.
- If no missing data is found, the function prints `"No missing data found"`. On the other hand, if missing data is present, it displays the column name and the indices where the missing data is located.

In [8]:
def check_missing_data(data):
    columns = list(data.keys())
    missing_data = {}
    missing_ratio_results = {}

    for col in columns:
        col_values = data[col]
        missing_indices = []

        for index, value in enumerate(col_values):
            if value is None or (isinstance(value, str) and value.strip() == ''):
                missing_indices.append(index)
                
        if missing_indices:
            missing_data[col] = missing_indices
        missing_ratio = len(missing_indices) / len(col_values)
        missing_ratio_results[col] = missing_ratio

    if not missing_data:
        print("No missing data found")
    
    for col, indices in missing_data.items():
        print(f"Missing data in column '{col}' at indices: {indices}")
    print("\nMissing ratio per column:")
    
    for col, ratio in missing_ratio_results.items():
        print(f"Column '{col}' missing ratio: {ratio:.2%}") 
        
    return missing_data, missing_ratio_results
missing_data, missing_ratio_results = check_missing_data(data)

No missing data found

Missing ratio per column:
Column 'Rank' missing ratio: 0.00%
Column 'Title' missing ratio: 0.00%
Column 'Foreign %' missing ratio: 0.00%
Column 'Domestic %' missing ratio: 0.00%
Column 'Year' missing ratio: 0.00%
Column 'Genre' missing ratio: 0.00%
Column 'Director' missing ratio: 0.00%
Column 'Writer' missing ratio: 0.00%
Column 'Cast' missing ratio: 0.00%


- Because of the missing ratio of all columns is 0.00% and there's no duplicated data, it means there is no missing data. The dataset is ready for analysis without the need for imputation or data cleaning related to missing values or duplicated data.