Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
Results		Results
phase1		phase1
phase2		phase2
.gitignor		.gitignor
README.md		README.md
Vahid-Ramezani_Sara-Asadi.pdf		Vahid-Ramezani_Sara-Asadi.pdf
Walt_disney_movie_dataset.csv		Walt_disney_movie_dataset.csv

Repository files navigation

Data-Mining

Data mining on a dataset of movies produced by Walt Disney Company, using python.

Introduction

In this paper, we are willing to share the results of our Data Mining on the Walt Diesney moveies dataset.

Contributors

Name	email	github name
Sara Asadi	saraasadi7899@gmail.com	saraasadi78
Vahid Ramezani	vahid.ramezani.2014@gmail.com	ConnorLynch2000

Contents

Attribute's properties
Data Skewness
Data Correctness & Completeness
Outlier Detction
Histogram
Dissimilarity Matrix
Corrolation
Scatter Plot
Data Cleaning
Redundant Data
Dimensional Reduction
Normalization
Numerosity Reduction

Atrribute's Properties

Name	Type	Range	Mean	Median	Mode	Min	Max
Title	Nominal	--	--	--	--	--	--
Production Company	Nominal	--	--	--	--	--	--
Release Date	Interval	[-,now]	--	--	--	--	--
Running Time	Ratio	[40,167]	97.8136	96	100	40.0	167
Country	Nominal	--	--	--	--	--	--
Language	Nominal	--	--	--	--	--	--
Box Office	Ratio	[7.7,1.657b]	165932061.3922	44900000.0	4000000	7.7	1657000000
Budget	Ratio	[300k,410.6m]	63877468.3098	30000000.0	[5.0e+06 1.5e+08]	300000	410600000.0
Directed By	Nominal	--	--	--	--	--	--
Writen by	Nominal	--	--	--	--	--	--
Based on	Nominal	--	--	--	--	--	--
Produced by	Nominal	--	--	--	--	--	--
Starring	Nominal	--	--	--	--	--	--
Music by	Nominal	--	--	--	--	--	--
Distributed by	Nominal	--	--	--	--	--	--
Story by	Nominal	--	--	--	--	--	--
Narrated by	Nominal	--	--	--	--	--	--
Cinematography	Nominal	--	--	--	--	--	--
Edited by	Nominal	--	--	--	--	--	--
Screenplay by	Nominal	--	--	--	--	--	--
Production Company	Nominal	--	--	--	--	--	--
Color Proccess	Nominal	--	--	--	--	--	--
Hepburn	Nominal	--	--	--	--	--	--
Adaption by	Nominal	--	--	--	--	--	--
Animation by	Nominal	--	--	--	--	--	--

Data Skewness

You can simply observe the skewness computed for diffrent attributes here.

Results are as follows:

Box Office

Budget

Running Time

Data Correctness and Completeness

Using human sences and also statistical factors the correctness and completeness of data have been analysed. Data cleaning step have been taken based on the results of the aforementioned analysis.

You can see below the visual results of the analysis.

Outlier Detection

Using box plot tool, the valid range of values for data has been computed and data records above the upper quartile and below the lower quartile have been removed.

Box plots are shown below.

Box Office Box Plot

Budget Box Plot

Running Time Box Plot

Histogram

Below you can see histogram of some of the data attributes. You can find the computation code here.

Box Office

Budget

Running Time

Dissimilarity Matrix

After removing unnecessary attributes, the dissimilarity matrix have been computed. You can see and follow the computation steps [here]: (dissimilarity_matrix.ipynb)

Corrolation

Corrolation between attributes of dataset have been computed after some cleaning and normalization steps. Some unnecessary attributes have been ignored and some nominal attributes have been converted to numerical values. The procedure and the final results can be find here

Corrolation Heatmap

Scatter Plot

Scatter plot for corrolated attributes are shown below.

Company-Director Scatter Plot

Director-Composer Scatter Plot

Language-Country Scatter Plot

Budget-Box Office Scatter Plot

Data Cleaning

Doing cleaning on the dataset with 35 attributes(columns), we reached a dataset with only 17 attributes(columns).

Columns containing more than 40% missing or invalid values have been dropped.

Also the remaining columns' missing values have been imputed automatically using attribute's median.

You can see the procedure and the results by executing cleaningDataset.py file.

Redundant Data

redundant data records have been handled or removed during data cleaning proccess.

Dimensional Reduction

Some useful notes can be taken from regularly observing the dataset. There are some attributes with different labels but containing same data values, Outwardly and inwardly. There also are some attributes with more than 40% of missing data or invalid data.

All these cases have been addressed while doing cleaning on the data set.

Normalization

Data have been normalized using min-max scaling which scaled data in range [0,1]. Alongside with the normalizing the dataset, z-score standardization have been also done on the dataset. You can see and compare an attribute before and after the procedure below.

Original and Normalized Data Scatter Plot

Normalized Data Scatter Plot

The results are accessible at min-max_normalization.py.

numerosity Reduction

Knowing the consepts of the records of the datasets, it's simply wrong to perform a numerosity reduction on the data. Each record contains data about a single movie produced by Disney Pictures and records are not normally related to on another. However, redundancy is not accepted and duplicate data records must be removed from the dataset. Also outlier records have been detected and removed. This problems have been solved during data cleaning and outlier detection proccesses.

About

Data mining with python

Report repository

Releases

No releases published

Packages

No packages published

Contributors 2

Languages