Skip to content

saraasadi78/Data-Mining

Repository files navigation

Data-Mining

Data mining on a dataset of movies produced by Walt Disney Company, using python.


Introduction

In this paper, we are willing to share the results of our Data Mining on the Walt Diesney moveies dataset.


Contributors

Name email github name
Sara Asadi saraasadi7899@gmail.com saraasadi78
Vahid Ramezani vahid.ramezani.2014@gmail.com ConnorLynch2000

Contents


Atrribute's Properties

⬆️

Name Type Range Mean Median Mode Min Max
Title Nominal -- -- -- -- -- --
Production Company Nominal -- -- -- -- -- --
Release Date Interval [-,now] -- -- -- -- --
Running Time Ratio [40,167] 97.8136 96 100 40.0 167
Country Nominal -- -- -- -- -- --
Language Nominal -- -- -- -- -- --
Box Office Ratio [7.7,1.657b] 165932061.3922 44900000.0 4000000 7.7 1657000000
Budget Ratio [300k,410.6m] 63877468.3098 30000000.0 [5.0e+06 1.5e+08] 300000 410600000.0
Directed By Nominal -- -- -- -- -- --
Writen by Nominal -- -- -- -- -- --
Based on Nominal -- -- -- -- -- --
Produced by Nominal -- -- -- -- -- --
Starring Nominal -- -- -- -- -- --
Music by Nominal -- -- -- -- -- --
Distributed by Nominal -- -- -- -- -- --
Story by Nominal -- -- -- -- -- --
Narrated by Nominal -- -- -- -- -- --
Cinematography Nominal -- -- -- -- -- --
Edited by Nominal -- -- -- -- -- --
Screenplay by Nominal -- -- -- -- -- --
Production Company Nominal -- -- -- -- -- --
Color Proccess Nominal -- -- -- -- -- --
Hepburn Nominal -- -- -- -- -- --
Adaption by Nominal -- -- -- -- -- --
Animation by Nominal -- -- -- -- -- --

Data Skewness

⬆️

You can simply observe the skewness computed for diffrent attributes here.

Results are as follows:

Box Office

Box Office Skewness

Budget

Budget Skewness

Running Time

Running Time


Data Correctness and Completeness

⬆️

Using human sences and also statistical factors the correctness and completeness of data have been analysed. Data cleaning step have been taken based on the results of the aforementioned analysis.

You can see below the visual results of the analysis.

incompleteness


Outlier Detection

⬆️

Using box plot tool, the valid range of values for data has been computed and data records above the upper quartile and below the lower quartile have been removed.

Box plots are shown below.

box office box plot

budget box plot

running time box plot


Histogram

⬆️

Below you can see histogram of some of the data attributes. You can find the computation code here.

Box Office

Box Office Histogram

Budget

Budget Histogram

Running Time

Running Histogram


Dissimilarity Matrix

⬆️

After removing unnecessary attributes, the dissimilarity matrix have been computed. You can see and follow the computation steps [here]: (dissimilarity_matrix.ipynb)


Corrolation

⬆️

Corrolation between attributes of dataset have been computed after some cleaning and normalization steps. Some unnecessary attributes have been ignored and some nominal attributes have been converted to numerical values. The procedure and the final results can be find here

corrolation heatmap


Scatter Plot

⬆️

Scatter plot for corrolated attributes are shown below.

company-director scatter

company-director scatter

company-director scatter

company-director scatter


Data Cleaning

⬆️

Doing cleaning on the dataset with 35 attributes(columns), we reached a dataset with only 17 attributes(columns).

Columns containing more than 40% missing or invalid values have been dropped.

Also the remaining columns' missing values have been imputed automatically using attribute's median.

You can see the procedure and the results by executing cleaningDataset.py file.


Redundant Data

⬆️

redundant data records have been handled or removed during data cleaning proccess.


Dimensional Reduction

⬆️

Some useful notes can be taken from regularly observing the dataset. There are some attributes with different labels but containing same data values, Outwardly and inwardly. There also are some attributes with more than 40% of missing data or invalid data.

All these cases have been addressed while doing cleaning on the data set.


Normalization

⬆️

Data have been normalized using min-max scaling which scaled data in range [0,1]. Alongside with the normalizing the dataset, z-score standardization have been also done on the dataset. You can see and compare an attribute before and after the procedure below.

original and normalized scatter

normalized scatter plot

The results are accessible at min-max_normalization.py.


numerosity Reduction

⬆️

Knowing the consepts of the records of the datasets, it's simply wrong to perform a numerosity reduction on the data. Each record contains data about a single movie produced by Disney Pictures and records are not normally related to on another. However, redundancy is not accepted and duplicate data records must be removed from the dataset. Also outlier records have been detected and removed. This problems have been solved during data cleaning and outlier detection proccesses.

About

Data mining with python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published