# **DS Bootcamp. Team00**

## Team TL:crissyro:
   - *Roman (crissyro) - Leader*
   - *Matvey (belbysha)*
   - *Michael (stannisl)*


## Data in ml-latests-small:

1. **ratings.csv:** userId, movieId, rating, timestamp

2. **movies.csv:** movieId, title, genres

3. **tags.csv:** userId, movieId, tag, timestamp

4. **links.csv:** movieId, imdbId, tmdbId

## How to start with project 
----------------------------------------------------------------

1. Clone this repo: `git clone https://github.com/DS-Bootcamp-Team00/movielens_analysis.git`

2. Create python environment

```bash
python3 -m venv .env

source .env/bin/activate
```

3. Install necessary packages:

```bash 
pip install -r requirements.txt
```

4. Run tests:

```bash
pytest tests/tests_run.py
```

5. Optional: Run docker build:

```bash
docker build -t movielens_analysis:latest. 
```

Run container:

```bash
docker run -p 8888:8888 -v $(pwd):/app movielens_project
```

## Project Structure

1. `movielens_analysis`: Main module with classes and functions for data analysis.

2. `tests`: Unit tests for the `movielens_analysis` module.

3. `requirements.txt`: List of required packages for the project.

4. `Dockerfile`: Dockerfile for building Docker image of the project.


In [44]:
# Start tests

!pytest tests/tests_run.py
! rm -rf __pycache__ .pytest_cache tests/__pycache__

platform linux -- Python 3.12.3, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/crissyro/school21/DS_Bootcamp.Team00-2/src
plugins: anyio-4.7.0
collected 26 items                                                             [0m

tests/tests_run.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                            [100%][0m



In [6]:
from movielens_analysis import Ratings, calculate_mean, calculate_median, Links

In [7]:
RATINGS_PATH = "ml-latest-small/ratings.csv"
LINKS_PATH = "ml-latest-small/links.csv"
MOVIES_PATH = "ml-latest-small/movies.csv"
TAGS_PATH = "ml-latest-small/tags.csv"

ratings = Ratings(RATINGS_PATH)
movies = ratings.Movies(ratings.data)
users = ratings.Users(ratings.data)

## Посмотрим на количество рейтингов за каждый код
### Библиотек для визуализаций нам не завезли, поэтому наслаждаемся тем, что есть

#### Заметим, что кол-во оценок распределено не равномерно, большее количество оценок приходится на **2000 год**, а меньшее на **1998 год**. 

In [10]:
movies.dist_by_year()

{1996: 6040,
 1997: 1916,
 1998: 507,
 1999: 2439,
 2000: 10061,
 2001: 3922,
 2002: 3478,
 2003: 4014,
 2004: 3279,
 2005: 5813,
 2006: 4059,
 2007: 7114,
 2008: 4351,
 2009: 4158,
 2010: 2300,
 2011: 1690,
 2012: 4657,
 2013: 1664,
 2014: 1439,
 2015: 6616,
 2016: 6702,
 2017: 8199,
 2018: 6418}

In [50]:
%timeit movies.dist_by_year()

117 ns ± 6.35 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


### Теперь посмотрим на распределение рейтингов. Заметим, что большее кол-во оценивающих ставят 4.0 и 3.0, что не удивительно 

In [7]:
movies.dist_by_rating()

{0.5: 1370,
 1.0: 2811,
 1.5: 1791,
 2.0: 7551,
 2.5: 5550,
 3.0: 20047,
 3.5: 13136,
 4.0: 26818,
 4.5: 8551,
 5.0: 13211}

In [49]:
%timeit movies.dist_by_rating()

131 ns ± 17.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


# НЕТ PANDAS - НЕТ И АНАЛИЗА 
### Здесь представлено топ 5 фильмов по количеству оценок


In [19]:
movies.top_by_num_of_ratings(5)

{356: 329, 318: 317, 296: 307, 593: 279, 2571: 278}

In [48]:
%timeit movies.top_by_num_of_ratings(5)

161 ns ± 28.9 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


### Здесь представлено топ 5 фильмов по средней оценке

In [20]:
movies.top_by_ratings(5, metric=calculate_mean)

{131724: 5.0, 5746: 5.0, 6835: 5.0, 3851: 5.0, 1151: 5.0}

In [47]:
%timeit movies.top_by_ratings(5, metric=calculate_mean)

297 ns ± 18.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


### Здесь представлено топ 5 фильмов по медианной оценке

In [21]:
movies.top_by_ratings(5, metric=calculate_median)

{131724: 5.0, 5746: 5.0, 6835: 5.0, 70946: 5.0, 3851: 5.0}

In [46]:
%timeit movies.top_by_ratings(5, metric=calculate_median)

366 ns ± 68.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


### Здесь представлено топ 5 фильмов по дисперсии оценки

In [22]:
movies.top_controversial(5)

{32892: 10.12, 2068: 10.12, 484: 8.0, 3223: 8.0, 7564: 8.0}

In [45]:
%timeit movies.top_controversial(5)

164 ns ± 16.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


## Проанализируем теперь пользователей

### Здесь представлены пользователи и кол-во их оценок

In [23]:
users.dist_by_num_of_ratings()

{1: 232,
 2: 29,
 3: 39,
 4: 216,
 5: 44,
 6: 314,
 7: 152,
 8: 47,
 9: 46,
 10: 140,
 11: 64,
 12: 32,
 13: 31,
 14: 48,
 15: 135,
 16: 98,
 17: 105,
 18: 502,
 19: 703,
 20: 242,
 21: 443,
 22: 119,
 23: 121,
 24: 110,
 25: 26,
 26: 21,
 27: 135,
 28: 570,
 29: 81,
 30: 34,
 31: 50,
 32: 102,
 33: 156,
 34: 86,
 35: 23,
 36: 60,
 37: 21,
 38: 78,
 39: 100,
 40: 103,
 41: 217,
 42: 440,
 43: 114,
 44: 48,
 45: 399,
 46: 42,
 47: 140,
 48: 33,
 49: 21,
 50: 310,
 51: 359,
 52: 130,
 53: 20,
 54: 33,
 55: 25,
 56: 46,
 57: 476,
 58: 112,
 59: 107,
 60: 22,
 61: 39,
 62: 366,
 63: 271,
 64: 517,
 65: 34,
 66: 345,
 67: 36,
 68: 1260,
 69: 46,
 70: 62,
 71: 35,
 72: 45,
 73: 210,
 74: 177,
 75: 69,
 76: 119,
 77: 29,
 78: 61,
 79: 64,
 80: 167,
 81: 26,
 82: 227,
 83: 118,
 84: 293,
 85: 34,
 86: 70,
 87: 21,
 88: 56,
 89: 518,
 90: 54,
 91: 575,
 92: 24,
 93: 97,
 94: 56,
 95: 168,
 96: 78,
 97: 36,
 98: 92,
 99: 53,
 100: 148,
 101: 61,
 102: 56,
 103: 377,
 104: 273,
 105: 722,
 106: 3

In [44]:
%timeit users.dist_by_num_of_ratings()

140 ns ± 15.4 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


### Здесь представлен словарь с пользователями и их средняя оценка

In [24]:
users.dist_by_average_rating(metric=calculate_mean)

{1: 4.37,
 2: 3.95,
 3: 2.44,
 4: 3.56,
 5: 3.64,
 6: 3.49,
 7: 3.23,
 8: 3.57,
 9: 3.26,
 10: 3.28,
 11: 3.78,
 12: 4.39,
 13: 3.65,
 14: 3.4,
 15: 3.45,
 16: 3.72,
 17: 4.21,
 18: 3.73,
 19: 2.61,
 20: 3.59,
 21: 3.26,
 22: 2.57,
 23: 3.65,
 24: 3.65,
 25: 4.81,
 26: 3.24,
 27: 3.55,
 28: 3.02,
 29: 4.14,
 30: 4.74,
 31: 3.92,
 32: 3.75,
 33: 3.79,
 34: 3.42,
 35: 4.09,
 36: 2.63,
 37: 4.14,
 38: 3.22,
 39: 4.0,
 40: 3.77,
 41: 3.25,
 42: 3.57,
 43: 4.55,
 44: 3.35,
 45: 3.88,
 46: 4.0,
 47: 3.05,
 48: 4.03,
 49: 4.26,
 50: 2.78,
 51: 3.78,
 52: 4.48,
 53: 5.0,
 54: 3.03,
 55: 2.84,
 56: 3.8,
 57: 3.39,
 58: 3.9,
 59: 4.36,
 60: 3.73,
 61: 4.05,
 62: 4.08,
 63: 3.63,
 64: 3.77,
 65: 4.03,
 66: 4.02,
 67: 3.97,
 68: 3.23,
 69: 4.37,
 70: 4.32,
 71: 3.6,
 72: 4.16,
 73: 3.71,
 74: 4.27,
 75: 3.23,
 76: 3.08,
 77: 4.0,
 78: 3.16,
 79: 4.2,
 80: 4.26,
 81: 2.77,
 82: 3.38,
 83: 3.31,
 84: 3.69,
 85: 3.71,
 86: 3.93,
 87: 3.95,
 88: 4.04,
 89: 3.47,
 90: 4.07,
 91: 3.4,
 92: 3.94,
 93: 4.

In [43]:
%timeit users.dist_by_average_rating(metric=calculate_mean)

339 ns ± 35.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


### Здесь представлен словарь с пользователями и их медианная оценка

In [25]:
users.dist_by_average_rating(metric=calculate_median)

{1: 5.0,
 2: 4.0,
 3: 0.5,
 4: 4.0,
 5: 4.0,
 6: 3.0,
 7: 3.5,
 8: 3.0,
 9: 3.0,
 10: 3.5,
 11: 4.0,
 12: 4.75,
 13: 4.0,
 14: 3.0,
 15: 3.5,
 16: 4.0,
 17: 4.0,
 18: 4.0,
 19: 3.0,
 20: 3.75,
 21: 3.5,
 22: 3.0,
 23: 3.5,
 24: 3.5,
 25: 5.0,
 26: 3.0,
 27: 4.0,
 28: 3.0,
 29: 4.0,
 30: 5.0,
 31: 4.0,
 32: 4.0,
 33: 4.0,
 34: 4.0,
 35: 4.0,
 36: 2.5,
 37: 4.0,
 38: 3.0,
 39: 4.0,
 40: 4.0,
 41: 3.5,
 42: 4.0,
 43: 5.0,
 44: 3.0,
 45: 4.0,
 46: 4.0,
 47: 3.0,
 48: 4.0,
 49: 4.5,
 50: 3.0,
 51: 4.0,
 52: 5.0,
 53: 5.0,
 54: 3.0,
 55: 3.0,
 56: 4.0,
 57: 4.0,
 58: 4.0,
 59: 5.0,
 60: 4.0,
 61: 4.0,
 62: 4.0,
 63: 4.0,
 64: 4.0,
 65: 4.0,
 66: 4.0,
 67: 4.0,
 68: 3.25,
 69: 5.0,
 70: 4.5,
 71: 4.0,
 72: 4.5,
 73: 4.0,
 74: 4.0,
 75: 4.0,
 76: 3.5,
 77: 5.0,
 78: 3.0,
 79: 4.0,
 80: 4.0,
 81: 3.0,
 82: 3.5,
 83: 3.5,
 84: 4.0,
 85: 4.0,
 86: 4.0,
 87: 4.0,
 88: 4.5,
 89: 3.5,
 90: 4.0,
 91: 3.5,
 92: 4.0,
 93: 4.0,
 94: 3.0,
 95: 4.0,
 96: 4.0,
 97: 4.0,
 98: 4.0,
 99: 4.0,
 100: 4.0,
 101:

In [41]:
%timeit users.dist_by_average_rating(metric=calculate_median)

256 ns ± 26.9 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


### Здесь представлен словарь с топ 5 пользователями и их диспресионная оценка

In [26]:
users.top_by_variance(5)

{3: 4.37, 55: 3.22, 461: 3.22, 259: 3.05, 329: 3.05}

In [42]:
%timeit users.top_by_variance(5)

131 ns ± 11.9 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


## Проанализируем теперь ссылки - модуль Links

In [27]:
links = Links(LINKS_PATH, 10)

## Metods Links

* **get_imdb(self, list_of_movies: list, list_of_fields: list)** - Извлекает информацию о фильмах с **IMDb** по указанным полям

* **top_directors(self, n)** - Возвращает **n** самых популярных режиссеров

* **most_expensive(self, n)** - Возвращает **n** самых дорогих фильмов

* **most_profitable(self, n)** - Возвращает **n** самых прибыльных фильмов

* **longest(self, n)** - Возвращает **n** самых длинных фильмов

* **top_cost_per_minute(self, n)** - Возвращает **n** фильмов с наибольшими затратами на минуту экранного времени

* **top_languages(self, n)** - Возвращает **n** самых популярных языков фильмов

In [28]:
list_movies = [1, 2]
list_fields = ['Title', 'Director', 'Budget', 'Gross worldwide', 'Runtime']

links.get_imdb(list_movies, list_fields)

[['0114709',
  'История игрушек',
  'John Lasseter',
  '$30,000,000 (estimated)',
  '$394,436,586',
  '1 hour 21 minutes'],
 ['0113497',
  'Джуманджи',
  'Joe Johnston',
  '$65,000,000 (estimated)',
  '$262,821,940',
  '1 hour 44 minutes']]

In [29]:
links.top_directors(3)

{'Forest Whitaker': 1, 'John Lasseter': 1, 'Peter Hyams': 1}

In [30]:
links.most_profitable(3)

{'История игрушек': 364436586, 'Джуманджи': 197821940, 'Схватка': 127436818}

In [31]:
links.longest(3)

{'Схватка': 170, 'Сабрина': 127, 'В ожидании выдоха': 124}

In [32]:
links.top_languages(3)

{'English': 4, 'French': 2, 'Spanish': 1}

In [33]:
%timeit links.get_imdb(list_movies, list_fields)

2.94 s ± 286 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [34]:
%timeit links.top_directors(3)

14.7 s ± 1.23 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [35]:
%timeit links.most_expensive(3)

15.6 s ± 7.74 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [36]:
%timeit links.most_profitable(3)

12.7 s ± 690 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [38]:
%timeit links.longest(3)

13.8 s ± 1.39 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [39]:
%timeit links.top_cost_per_minute(3)

12.4 s ± 548 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [40]:
%timeit links.top_languages(3)

13.2 s ± 875 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Перейдем к модулю Movies

In [2]:
from movielens_analysis import Movies

In [28]:
data_movies = Movies(MOVIES_PATH)

In [20]:
data_movies.dist_by_release()

{2002: 310,
 2006: 296,
 2001: 295,
 2000: 290,
 2007: 283,
 2009: 281,
 2003: 278,
 2014: 278,
 2004: 277,
 1996: 275,
 2015: 274,
 2005: 272,
 2008: 269,
 1999: 261,
 1995: 259,
 1997: 259,
 1998: 256,
 2011: 254,
 2010: 248,
 2013: 238,
 1994: 237,
 2012: 235,
 2016: 218,
 1993: 198,
 1992: 165,
 1988: 164,
 1987: 152,
 1990: 147,
 1991: 147,
 2017: 144,
 1989: 142,
 1986: 139,
 1985: 127,
 1984: 99,
 1981: 92,
 1980: 89,
 1982: 87,
 1983: 83,
 1979: 68,
 1977: 63,
 1973: 59,
 1978: 59,
 1965: 47,
 1971: 46,
 1974: 44,
 1964: 43,
 1976: 42,
 1967: 42,
 1975: 42,
 1966: 42,
 2018: 42,
 1968: 41,
 1962: 40,
 1972: 39,
 1963: 39,
 1959: 37,
 1960: 37,
 1969: 36,
 1955: 36,
 1961: 34,
 1970: 33,
 1957: 33,
 1958: 31,
 1953: 30,
 1956: 30,
 1940: 25,
 1949: 25,
 1954: 23,
 1942: 23,
 1939: 23,
 1946: 23,
 1951: 22,
 1950: 21,
 1947: 20,
 1948: 20,
 1941: 19,
 1936: 18,
 1945: 17,
 1937: 16,
 1952: 16,
 1944: 16,
 1938: 15,
 1931: 14,
 1935: 13,
 1933: 12,
 1934: 11,
 1943: 10,
 1932: 9,


In [21]:
%timeit data_movies.dist_by_release()

2.01 μs ± 363 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [24]:
data_movies.dist_by_genres()

{'Drama': 4361,
 'Comedy': 3756,
 'Thriller': 1894,
 'Action': 1828,
 'Romance': 1596,
 'Adventure': 1263,
 'Crime': 1199,
 'Sci-Fi': 980,
 'Horror': 978,
 'Fantasy': 779,
 'Children': 664,
 'Animation': 611,
 'Mystery': 573,
 'Documentary': 440,
 'War': 382,
 'Musical': 334,
 'Western': 167,
 'IMAX': 158,
 'Film-Noir': 87,
 '(no genres listed)': 34}

In [25]:
%timeit data_movies.dist_by_genres()

1.93 μs ± 222 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [29]:
data_movies.most_genres(5)

{'Rubber (2010)': 10,
 'Patlabor: The Movie (Kidô keisatsu patorebâ: The Movie) (1989)': 8,
 'Mulan (1998)': 7,
 'Who Framed Roger Rabbit? (1988)': 7,
 'Osmosis Jones (2001)': 7}

In [30]:
%timeit data_movies.most_genres(5)

687 ns ± 31.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


## Воспользуемся модулем tags

In [31]:
from movielens_analysis import Tags

data_tags = Tags(TAGS_PATH)

In [32]:
data_tags.longest(5)

['Something for everyone in this one... saw it without and plan on seeing it with kids!',
 'the catholic church is the most corrupt organization in history',
 'villain nonexistent or not needed for good story',
 'r:disturbing violent content including rape',
 '06 Oscar Nominated Best Movie - Animation']

In [33]:
%timeit data_tags.longest(5)

2.67 ms ± 323 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [35]:
data_tags.most_words(5)

{'Something for everyone in this one... saw it without and plan on seeing it with kids!': 16,
 'the catholic church is the most corrupt organization in history': 10,
 'villain nonexistent or not needed for good story': 8,
 '06 Oscar Nominated Best Movie - Animation': 7,
 'It was melodramatic and kind of dumb': 7}

In [36]:
%timeit data_tags.most_words(5)

3.09 ms ± 583 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [37]:
data_tags.most_words_and_longest(5)

['06 Oscar Nominated Best Movie - Animation',
 'the catholic church is the most corrupt organization in history',
 'Something for everyone in this one... saw it without and plan on seeing it with kids!',
 'villain nonexistent or not needed for good story']

In [38]:
%timeit data_tags.most_words_and_longest(5)

5.29 ms ± 666 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [40]:
data_tags.most_popular(5)

{'In Netflix queue': 131,
 'atmospheric': 36,
 'superhero': 24,
 'thought-provoking': 24,
 'funny': 23}

In [41]:
%timeit data_tags.most_popular(5)

2.24 ms ± 113 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [42]:
data_tags.tags_with('I')

['I am your father', 'I see dead people', 'World War I']

In [43]:
%timeit data_tags.tags_with('I')

2.15 ms ± 217 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
