## Netflix Movies and TV Shows
## Exploratory Data Analysis (EDA)
## Machine Learning Algorithms for Classification

In [1]:
import numpy as np              # Одномерные и многомерные массивы (array)
import pandas as pd             # Таблицы и временные ряды (dataframe, series)
import matplotlib.pyplot as plt # Научная графика
import seaborn as sns           # Еще больше красивой графики для визуализации данных
import sklearn                  # Алгоритмы машинного обучения

## 0. Описание задачи

Мы рассмотрим данные о телешоу и фильмах, доступных на Netflix по всему миру. Сначала мы проведем исследовательский анализ данных (EDA), чтобы лучше узнать и описать данные с помощью интерактивных графиков и визуализаций. Затем мы создадим классификатор, чтобы предсказать type (фильм/телешоу), используя признаки.

### В наборе данных 8807 строк и 12 столбцов (признаков):
 1. show_id - Уникальный идентификатор для каждого фильма/телепередачи
 2. type - Идентификатор - фильм или телешоу
 3. title - Название фильма/телепередачи
 4. director - Режиссер фильма
 5. cast - Актеры, задействованные в фильме/шоу
 6. country - Страна производства фильма/шоу
 7. date_added - Дата добавления на Netflix
 8. release_year - Фактический год выхода фильма/шоу
 9. rating - ТВ Рейтинг фильма/шоу
 10. duration - Общая продолжительность - в минутах или количестве сезонов
 11. listed_in - Жанр шоу
 12. description - Некоторый текст, описывающий шоу

## 1. Чтение данных

Загружаем файл из интернета:

In [2]:
url = "https://raw.githubusercontent.com/troshinvlaad/ML/main/netflix_titles.csv" 
data_raw = pd.read_csv(url)

In [3]:
data_raw.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [4]:
data_raw.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [5]:
data_raw

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


In [6]:
data_raw.tail()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."
8806,s8807,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...


## Частично произведем обработку данных и выделим категориальные признаки

Типы признаков:

- Качественные (*категориальные*, *факторные*):
  - Неупорядоченные (*номинальные*)
  - Упорядоченные (*порядковые*)
- Количественные (*числовые*):
  - *Непрерывные*
  - *Дискретные*

*Бинарные* признаки (которые принимают только два значения) можно считать и номинальными, и порядковыми, и дискретными    

In [7]:
data_raw.dtypes

show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object

Имена столбцов (признаков) можно получить таким образом:

Столбцы `'show_id'`, `'type'`, `'title'`, `'director'`, `'cast'`, `'country'`, `'date_added'`, `'rating'`, `'duration'`, `'listed_in'` и `'description'`, содержат категориальные значения. Пока они имеют тип `'Object`'. Заменим тип на специальный, предназначенный для хранения категориальных значений:

Pandas реализует 2 основных класса: Series, DataFrame

In [8]:
type(data_raw)

pandas.core.frame.DataFrame

Замена типа у признака `'type'`:

In [9]:
data_raw['type']

0         Movie
1       TV Show
2       TV Show
3       TV Show
4       TV Show
         ...   
8802      Movie
8803    TV Show
8804      Movie
8805      Movie
8806      Movie
Name: type, Length: 8807, dtype: object

In [10]:
data_raw['type'].dtype

dtype('O')

In [11]:
data_raw['type'] = data_raw['type'].astype('category')

In [12]:
data_raw['type'].dtype

CategoricalDtype(categories=['Movie', 'TV Show'], ordered=False)

Замена типа у признака `'title'`:

In [13]:
data_raw['title']

0        Dick Johnson Is Dead
1               Blood & Water
2                   Ganglands
3       Jailbirds New Orleans
4                Kota Factory
                ...          
8802                   Zodiac
8803              Zombie Dumb
8804               Zombieland
8805                     Zoom
8806                   Zubaan
Name: title, Length: 8807, dtype: object

In [14]:
data_raw['title'].dtype

dtype('O')

In [15]:
data_raw['title'] = data_raw['title'].astype('category')

In [16]:
data_raw['title'].dtype

CategoricalDtype(categories=['#Alive', '#AnneFrank - Parallel Stories',
                  '#FriendButMarried', '#FriendButMarried 2', '#Roxy',
                  '#Rucker50', '#Selfie', '#Selfie 69', '#blackAF',
                  '#cats_the_mewvie',
                  ...
                  '​Goli Soda 2', '​Maj Rati ​​Keteki', '​Mayurakshi',
                  '​SAINT SEIYA: Knights of the Zodiac',
                  '​​Kuch Bheege Alfaaz', '忍者ハットリくん', '海的儿子', '마녀사냥',
                  '반드시 잡는다', '최강전사 미니특공대 : 영웅의 탄생'],
, ordered=False)