Data mining on a dataset of movies produced by Walt Disney Company, using python.
In this paper, we are willing to share the results of our Data Mining on the Walt Diesney moveies dataset.
Name | github name | |
---|---|---|
Sara Asadi | saraasadi7899@gmail.com | saraasadi78 |
Vahid Ramezani | vahid.ramezani.2014@gmail.com | ConnorLynch2000 |
Name | Type | Range | Mean | Median | Mode | Min | Max |
---|---|---|---|---|---|---|---|
Title | Nominal | -- | -- | -- | -- | -- | -- |
Production Company | Nominal | -- | -- | -- | -- | -- | -- |
Release Date | Interval | [-,now] | -- | -- | -- | -- | -- |
Running Time | Ratio | [40,167] | 97.8136 | 96 | 100 | 40.0 | 167 |
Country | Nominal | -- | -- | -- | -- | -- | -- |
Language | Nominal | -- | -- | -- | -- | -- | -- |
Box Office | Ratio | [7.7,1.657b] | 165932061.3922 | 44900000.0 | 4000000 | 7.7 | 1657000000 |
Budget | Ratio | [300k,410.6m] | 63877468.3098 | 30000000.0 | [5.0e+06 1.5e+08] | 300000 | 410600000.0 |
Directed By | Nominal | -- | -- | -- | -- | -- | -- |
Writen by | Nominal | -- | -- | -- | -- | -- | -- |
Based on | Nominal | -- | -- | -- | -- | -- | -- |
Produced by | Nominal | -- | -- | -- | -- | -- | -- |
Starring | Nominal | -- | -- | -- | -- | -- | -- |
Music by | Nominal | -- | -- | -- | -- | -- | -- |
Distributed by | Nominal | -- | -- | -- | -- | -- | -- |
Story by | Nominal | -- | -- | -- | -- | -- | -- |
Narrated by | Nominal | -- | -- | -- | -- | -- | -- |
Cinematography | Nominal | -- | -- | -- | -- | -- | -- |
Edited by | Nominal | -- | -- | -- | -- | -- | -- |
Screenplay by | Nominal | -- | -- | -- | -- | -- | -- |
Production Company | Nominal | -- | -- | -- | -- | -- | -- |
Color Proccess | Nominal | -- | -- | -- | -- | -- | -- |
Hepburn | Nominal | -- | -- | -- | -- | -- | -- |
Adaption by | Nominal | -- | -- | -- | -- | -- | -- |
Animation by | Nominal | -- | -- | -- | -- | -- | -- |
You can simply observe the skewness computed for diffrent attributes here.
Results are as follows:
Using human sences and also statistical factors the correctness and completeness of data have been analysed. Data cleaning step have been taken based on the results of the aforementioned analysis.
You can see below the visual results of the analysis.
Using box plot tool, the valid range of values for data has been computed and data records above the upper quartile and below the lower quartile have been removed.
Box plots are shown below.
Below you can see histogram of some of the data attributes. You can find the computation code here.
After removing unnecessary attributes, the dissimilarity matrix have been computed. You can see and follow the computation steps [here]: (dissimilarity_matrix.ipynb)
Corrolation between attributes of dataset have been computed after some cleaning and normalization steps. Some unnecessary attributes have been ignored and some nominal attributes have been converted to numerical values. The procedure and the final results can be find here
Scatter plot for corrolated attributes are shown below.
Doing cleaning on the dataset with 35 attributes(columns), we reached a dataset with only 17 attributes(columns).
Columns containing more than 40% missing or invalid values have been dropped.
Also the remaining columns' missing values have been imputed automatically using attribute's median.
You can see the procedure and the results by executing cleaningDataset.py file.
redundant data records have been handled or removed during data cleaning proccess.
Some useful notes can be taken from regularly observing the dataset. There are some attributes with different labels but containing same data values, Outwardly and inwardly. There also are some attributes with more than 40% of missing data or invalid data.
All these cases have been addressed while doing cleaning on the data set.
Data have been normalized using min-max scaling which scaled data in range [0,1]. Alongside with the normalizing the dataset, z-score standardization have been also done on the dataset. You can see and compare an attribute before and after the procedure below.
The results are accessible at min-max_normalization.py.
Knowing the consepts of the records of the datasets, it's simply wrong to perform a numerosity reduction on the data. Each record contains data about a single movie produced by Disney Pictures and records are not normally related to on another. However, redundancy is not accepted and duplicate data records must be removed from the dataset. Also outlier records have been detected and removed. This problems have been solved during data cleaning and outlier detection proccesses.