# Projekt: YouTube trending videos

## Etap 1 - Atrybuty tekstowe

### Wstęp

Celem projektu jest przeprowadzenie procesu **odkrywania wiedzy** z danych dotyczących **filmów z serwisu YouTube**.
Należy wskazać, jakie atrybuty powinien mieć przysłany film, aby trafił na **kartę `"na czasie"`**.

### Krótka specyfikacja zbioru danych

Do dyspozycji są 2 zbiory danych: ([dostępne tutaj](https://www.cs.put.poznan.pl/kmiazga/students/ped/youtube_data.zip))
* **GB_videos_5p.csv** (`38916 wierszy`, `52 879 KB`)
* **US_videos_5p.csv** (`40949 wierszy`, `61 654 KB`)

W dalszej części sprawozdania, przedstawione zostaną dokładne statystyki i wyniki, uzyskane w wyniku przetwarzania atrybutów tekstowych z kolumn:

|video_id|trending_date|title|publish_time|tags|views|likes|
|-|-|-|-|-|-|-|

|dislikes|comment_count|thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|description|
|-|-|-|-|-|-|-|

Wyniki atrybutów wizualnych, zostaną przedstawione w kolejnym sprawozdaniu.

### Opis struktury projektu

Projekt został napisany w języku `Python 3.7`. 
* Wykorzystane biblioteki zostały zapisane w pliku `youtube-project\requirements.txt` oraz `youtube-project\Dockerfile`
* Sposób uruchomienia został opisany w pliku `youtube-project\readme.md`

**Najważniejsza część kodu**, która dotyczy tego etapu, znajduje się w pliku **`youtube-project\src\analyzers\text_analyzers.py`**.  
Są tam *Analyzer'y*, odpowiedzialne za przetwarzanie atrybutów tekstowych.  
Każdy z nich został opisany w wyżej wspomnianym pliku.

### Wyniki analizy

W wyniku uruchomienia *Analyzerów*, został wygenerowany poniższy plik `JSON`:

<details>
<summary>JSON</summary>
    
``` json
{
    "data/GB_videos_5p.csv": {
        "BooleanAnalyzer": {
            "comments_disabled": 683,
            "ratings_disabled": 272,
            "video_error_or_removed": 69
        },
        "CaseAnalyzer": {
            "channel_title": {
                "Lower": 2103,
                "Mixed": 3710,
                "Title-like": 29159,
                "Upper": 3944
            },
            "comments_disabled": {
                "Title-like": 38916
            },
            "description": {
                "Lower": 84,
                "Mixed": 38046,
                "Title-like": 137,
                "Upper": 30
            },
            "publish_time": {
                "Upper": 38916
            },
            "ratings_disabled": {
                "Title-like": 38916
            },
            "tags": {
                "Lower": 12282,
                "Mixed": 20156,
                "Title-like": 5996,
                "Upper": 482
            },
            "title": {
                "Lower": 290,
                "Mixed": 33250,
                "Title-like": 3125,
                "Upper": 2251
            },
            "video_error_or_removed": {
                "Title-like": 38916
            }
        },
        "CommonWordsAnalyzer": {
            "channel_title": {
                "bbc": 423,
                "entertain": 506,
                "jimmi": 415,
                "late": 787,
                "live": 554,
                "music": 318,
                "new": 481,
                "night": 400,
                "offici": 283,
                "pictur": 471,
                "show": 681,
                "star": 474,
                "the": 2024,
                "trailer": 389,
                "with": 820
            },
            "comments_disabled": {
                "fals": 38233,
                "true": 683
            },
            "description": {
                "-": 58196,
                "I": 25404,
                "a": 45137,
                "and": 70283,
                "by": 24754,
                "for": 34742,
                "in": 33862,
                "is": 23634,
                "of": 49237,
                "on": 52161,
                "the": 129331,
                "to": 63936,
                "video": 18517,
                "with": 26851,
                "you": 27612
            },
            "publish_time": {
                "2018-02-04t20:27:38.000z": 38,
                "2018-02-04t23:28:16.000z": 38,
                "2018-02-05t01:51:53.000z": 38,
                "2018-02-08t02:00:01.000z": 40,
                "2018-02-08t14:00:00.000z": 37,
                "2018-02-08t14:00:04.000z": 37,
                "2018-02-08t21:00:01.000z": 37,
                "2018-02-15t03:34:44.000z": 37,
                "2018-02-19t01:37:11.000z": 37,
                "2018-02-20t13:51:08.000z": 37,
                "2018-02-23t05:00:01.000z": 53,
                "2018-03-09t05:00:03.000z": 55,
                "2018-03-30t04:00:02.000z": 52,
                "2018-04-25t16:00:11.000z": 38,
                "2018-05-08t11:05:08.000z": 38
            },
            "ratings_disabled": {
                "fals": 38644,
                "true": 272
            },
            "tags": {
                "&": 2258,
                "[none]": 2010,
                "a": 2059,
                "and": 3928,
                "for": 1982,
                "in": 2100,
                "last": 2284,
                "music": 2509,
                "new": 2621,
                "of": 4826,
                "offici": 2230,
                "the": 11593,
                "to": 3362,
                "war": 2474,
                "you": 2154
            },
            "title": {
                "&": 3052,
                "(offici": 3587,
                "-": 17845,
                "A": 1682,
                "and": 2071,
                "ft.": 2247,
                "in": 1682,
                "music": 1747,
                "of": 1975,
                "offici": 2154,
                "the": 9296,
                "trailer": 2710,
                "video)": 3881,
                "with": 2093,
                "|": 8728
            },
            "video_error_or_removed": {
                "fals": 38847,
                "true": 69
            }
        },
        "DayAnalyzer": {
            "publish_time": {
                "Friday": 7634,
                "Monday": 5821,
                "Saturday": 2202,
                "Sunday": 2815,
                "Thursday": 7159,
                "Tuesday": 5952,
                "Wednesday": 7333
            }
        },
        "DigitsAnalyzer": {
            "channel_title": 529,
            "comments_disabled": 0,
            "description": 346,
            "publish_time": 38916,
            "ratings_disabled": 0,
            "tags": 482,
            "title": 718,
            "video_error_or_removed": 0
        },
        "HourAnalyzer": {
            "publish_time": {
                "0": 1140,
                "12": 1355,
                "13": 1968,
                "14": 2372,
                "15": 3014,
                "16": 3231,
                "17": 3410,
                "18": 2281,
                "19": 2052,
                "20": 1555,
                "21": 1699,
                "22": 1513,
                "4": 1808,
                "5": 2147,
                "9": 1374
            }
        },
        "HyperlinkAnalyzer": {
            "channel_title": 0,
            "comments_disabled": 0,
            "description": 383,
            "publish_time": 0,
            "ratings_disabled": 0,
            "tags": 0,
            "title": 0,
            "video_error_or_removed": 0
        },
        "LongTextLettersAnalyzer": {
            "channel_title": {
                "0": 10719,
                "1": 24003,
                "2": 3077,
                "3": 1099,
                "4": 18
            },
            "comments_disabled": {
                "0": 38916
            },
            "description": {
                "18": 385,
                "19": 359,
                "33": 431,
                "34": 442,
                "35": 380,
                "37": 423,
                "38": 493,
                "41": 416,
                "44": 410,
                "46": 533,
                "47": 394,
                "49": 375,
                "57": 398,
                "8": 369,
                "9": 449
            },
            "publish_time": {
                "2": 38916
            },
            "ratings_disabled": {
                "0": 38916
            },
            "tags": {
                "0": 2048,
                "10": 1162,
                "12": 824,
                "13": 984,
                "16": 830,
                "17": 977,
                "4": 951,
                "45": 829,
                "47": 1030,
                "49": 903,
                "5": 1270,
                "50": 1238,
                "6": 1861,
                "7": 1327,
                "8": 1212
            },
            "title": {
                "0": 57,
                "1": 1558,
                "10": 58,
                "2": 4265,
                "3": 7808,
                "4": 7389,
                "5": 6556,
                "6": 4347,
                "7": 3449,
                "8": 1887,
                "9": 1542
            },
            "video_error_or_removed": {
                "0": 38916
            }
        },
        "LongTextWordsAnalyzer": {
            "channel_title": {
                "0": 37354,
                "1": 1562
            },
            "comments_disabled": {
                "0": 38916
            },
            "description": {
                "1": 1014,
                "10": 1323,
                "11": 1156,
                "12": 838,
                "13": 807,
                "14": 953,
                "15": 796,
                "2": 1821,
                "3": 2060,
                "4": 2033,
                "5": 2022,
                "6": 1997,
                "7": 2114,
                "8": 1535,
                "9": 1642
            },
            "publish_time": {
                "0": 38916
            },
            "ratings_disabled": {
                "0": 38916
            },
            "tags": {
                "0": 9701,
                "1": 7060,
                "10": 432,
                "11": 325,
                "12": 189,
                "13": 61,
                "15": 49,
                "2": 5411,
                "3": 3537,
                "4": 2745,
                "5": 2584,
                "6": 2759,
                "7": 1818,
                "8": 1283,
                "9": 934
            },
            "title": {
                "0": 3762,
                "1": 20825,
                "2": 11352,
                "3": 2913,
                "4": 64
            },
            "video_error_or_removed": {
                "0": 38916
            }
        },
        "PartsOfSpeechAnalyzer": {
            "description": {
                ",": 177326,
                ".": 143755,
                ":": 642845,
                "CC": 107155,
                "DT": 229325,
                "IN": 347403,
                "JJ": 353866,
                "NN": 932501,
                "NNP": 1197322,
                "NNS": 151407,
                "PRP": 135909,
                "RB": 103693,
                "VB": 125917,
                "VBP": 99783,
                "VBZ": 81914
            },
            "tags": {
                "''": 1322421,
                "CC": 14282,
                "CD": 28655,
                "DT": 29191,
                "IN": 24344,
                "JJ": 92446,
                "NN": 1175659,
                "NNP": 359908,
                "NNS": 79038,
                "PRP": 13120,
                "RB": 10025,
                "VB": 15263,
                "VBD": 16065,
                "VBG": 13240,
                "VBP": 12571
            },
            "title": {
                "(": 12772,
                ")": 12742,
                ",": 6048,
                ".": 7304,
                ":": 23482,
                "CC": 5501,
                "CD": 11655,
                "DT": 13067,
                "IN": 15977,
                "JJ": 10199,
                "NN": 25093,
                "NNP": 211971,
                "NNS": 5665,
                "PRP": 4854,
                "VBP": 4615
            }
        },
        "SpecialCharsAnalyzer": {
            "channel_title": 228,
            "comments_disabled": 0,
            "description": 2623,
            "publish_time": 0,
            "ratings_disabled": 0,
            "tags": 2701,
            "title": 1356,
            "video_error_or_removed": 0
        }
    },
    "data/US_videos_5p.csv": {
        "BooleanAnalyzer": {
            "comments_disabled": 633,
            "ratings_disabled": 169,
            "video_error_or_removed": 23
        },
        "CaseAnalyzer": {
            "channel_title": {
                "Lower": 1806,
                "Mixed": 3532,
                "Title-like": 32033,
                "Upper": 3578
            },
            "comments_disabled": {
                "Title-like": 40949
            },
            "description": {
                "Lower": 124,
                "Mixed": 40163,
                "Title-like": 51,
                "Upper": 33
            },
            "publish_time": {
                "Upper": 40949
            },
            "ratings_disabled": {
                "Title-like": 40949
            },
            "tags": {
                "Lower": 16052,
                "Mixed": 20611,
                "Title-like": 4130,
                "Upper": 156
            },
            "title": {
                "Lower": 361,
                "Mixed": 32789,
                "Title-like": 5197,
                "Upper": 2602
            },
            "video_error_or_removed": {
                "Title-like": 40949
            }
        },
        "CommonWordsAnalyzer": {
            "channel_title": {
                "entertain": 454,
                "fox": 351,
                "insid": 415,
                "jimmi": 394,
                "late": 698,
                "live": 522,
                "morn": 337,
                "new": 1010,
                "night": 358,
                "of": 337,
                "pictur": 442,
                "show": 805,
                "star": 364,
                "the": 2829,
                "with": 925
            },
            "comments_disabled": {
                "fals": 40316,
                "true": 633
            },
            "description": {
                "-": 75985,
                "I": 34429,
                "a": 58593,
                "and": 100542,
                "by": 24642,
                "for": 49457,
                "in": 40448,
                "is": 29260,
                "of": 61526,
                "on": 64680,
                "the": 162035,
                "thi": 27115,
                "to": 89089,
                "with": 32641,
                "you": 41424
            },
            "publish_time": {
                "2018-05-06t13:00:05.000z": 32,
                "2018-05-09t17:00:00.000z": 29,
                "2018-05-10t16:00:11.000z": 28,
                "2018-05-10t17:00:01.000z": 28,
                "2018-05-10t17:02:55.000z": 28,
                "2018-05-10t19:56:28.000z": 28,
                "2018-05-11t04:00:34.000z": 29,
                "2018-05-11t21:11:16.000z": 28,
                "2018-05-13t18:03:56.000z": 30,
                "2018-05-13t19:00:25.000z": 29,
                "2018-05-14t13:00:01.000z": 29,
                "2018-05-14t14:00:03.000z": 29,
                "2018-05-14t15:59:47.000z": 29,
                "2018-05-14t19:00:01.000z": 29,
                "2018-05-18t14:00:04.000z": 50
            },
            "ratings_disabled": {
                "fals": 40780,
                "true": 169
            },
            "tags": {
                "a": 4347,
                "and": 6280,
                "for": 2173,
                "in": 3068,
                "make": 1543,
                "music": 1621,
                "my": 1720,
                "new": 1956,
                "of": 6008,
                "on": 2173,
                "the": 9438,
                "to": 6896,
                "vs": 2089,
                "with": 2462,
                "you": 1591
            },
            "title": {
                "&": 2024,
                "-": 11452,
                "A": 2122,
                "I": 1940,
                "a": 2566,
                "and": 2299,
                "how": 1817,
                "in": 2176,
                "of": 2338,
                "the": 9943,
                "to": 2343,
                "trailer": 2202,
                "video)": 1939,
                "with": 2759,
                "|": 10663
            },
            "video_error_or_removed": {
                "fals": 40926,
                "true": 23
            }
        },
        "DayAnalyzer": {
            "publish_time": {
                "Friday": 7002,
                "Monday": 6177,
                "Saturday": 3593,
                "Sunday": 3679,
                "Thursday": 6950,
                "Tuesday": 6786,
                "Wednesday": 6762
            }
        },
        "DigitsAnalyzer": {
            "channel_title": 330,
            "comments_disabled": 0,
            "description": 225,
            "publish_time": 40949,
            "ratings_disabled": 0,
            "tags": 418,
            "title": 1377,
            "video_error_or_removed": 0
        },
        "HourAnalyzer": {
            "publish_time": {
                "0": 1436,
                "1": 1318,
                "12": 1551,
                "13": 2105,
                "14": 2807,
                "15": 3483,
                "16": 3669,
                "17": 3447,
                "18": 2889,
                "19": 2132,
                "20": 2136,
                "21": 2104,
                "22": 1959,
                "23": 1495,
                "4": 1262
            }
        },
        "HyperlinkAnalyzer": {
            "channel_title": 0,
            "comments_disabled": 0,
            "description": 290,
            "publish_time": 0,
            "ratings_disabled": 0,
            "tags": 8,
            "title": 0,
            "video_error_or_removed": 0
        },
        "LongTextLettersAnalyzer": {
            "channel_title": {
                "0": 11893,
                "1": 24935,
                "2": 3060,
                "3": 1036,
                "4": 25
            },
            "comments_disabled": {
                "0": 40949
            },
            "description": {
                "110": 322,
                "14": 350,
                "15": 322,
                "16": 315,
                "28": 364,
                "34": 400,
                "40": 310,
                "42": 309,
                "43": 342,
                "47": 309,
                "51": 321,
                "54": 444,
                "64": 307,
                "69": 336,
                "83": 307
            },
            "publish_time": {
                "2": 40949
            },
            "ratings_disabled": {
                "0": 40949
            },
            "tags": {
                "0": 1658,
                "10": 1001,
                "13": 927,
                "45": 929,
                "46": 1015,
                "47": 1033,
                "48": 904,
                "49": 1312,
                "5": 944,
                "50": 1637,
                "51": 1125,
                "6": 1060,
                "7": 981,
                "8": 1005,
                "9": 928
            },
            "title": {
                "0": 147,
                "1": 1806,
                "10": 123,
                "2": 4995,
                "3": 7653,
                "4": 8433,
                "5": 7086,
                "6": 4429,
                "7": 3058,
                "8": 1775,
                "9": 1444
            },
            "video_error_or_removed": {
                "0": 40949
            }
        },
        "LongTextWordsAnalyzer": {
            "channel_title": {
                "0": 39434,
                "1": 1515
            },
            "comments_disabled": {
                "0": 40949
            },
            "description": {
                "10": 1063,
                "11": 1047,
                "12": 1272,
                "13": 1028,
                "14": 1011,
                "15": 995,
                "16": 1011,
                "2": 1114,
                "3": 1287,
                "4": 1201,
                "5": 1372,
                "6": 1360,
                "7": 1367,
                "8": 1274,
                "9": 1102
            },
            "publish_time": {
                "0": 40949
            },
            "ratings_disabled": {
                "0": 40949
            },
            "tags": {
                "0": 8169,
                "1": 7104,
                "10": 412,
                "11": 303,
                "12": 107,
                "13": 25,
                "14": 14,
                "2": 5361,
                "3": 4135,
                "4": 3812,
                "5": 3136,
                "6": 3106,
                "7": 2442,
                "8": 1764,
                "9": 1042
            },
            "title": {
                "0": 4388,
                "1": 22373,
                "2": 11453,
                "3": 2627,
                "4": 108
            },
            "video_error_or_removed": {
                "0": 40949
            }
        },
        "PartsOfSpeechAnalyzer": {
            "description": {
                ",": 229922,
                ".": 235572,
                ":": 805677,
                "CC": 146079,
                "DT": 299314,
                "IN": 450947,
                "JJ": 440590,
                "NN": 1125464,
                "NNP": 1646949,
                "NNS": 214560,
                "PRP": 190458,
                "RB": 131780,
                "VB": 160666,
                "VBP": 121493,
                "VBZ": 96851
            },
            "tags": {
                "''": 1533493,
                "CC": 15930,
                "CD": 30232,
                "DT": 30219,
                "IN": 28639,
                "JJ": 124164,
                "NN": 1425013,
                "NNP": 282902,
                "NNS": 118488,
                "PRP": 12214,
                "RB": 14507,
                "VB": 20114,
                "VBD": 18722,
                "VBG": 26335,
                "VBP": 14982
            },
            "title": {
                "(": 7117,
                ")": 7125,
                ".": 10211,
                ":": 16393,
                "CD": 12059,
                "DT": 15663,
                "IN": 19439,
                "JJ": 11882,
                "NN": 23541,
                "NNP": 200445,
                "NNS": 7625,
                "PRP": 6432,
                "VB": 4968,
                "VBP": 4907,
                "VBZ": 5131
            }
        },
        "SpecialCharsAnalyzer": {
            "channel_title": 70,
            "comments_disabled": 0,
            "description": 1624,
            "publish_time": 0,
            "ratings_disabled": 0,
            "tags": 1769,
            "title": 882,
            "video_error_or_removed": 0
        }
    }
}
```
    
</details>

In [1]:
import chart_studio.plotly as py
import plotly.figure_factory as ff

# with open('text-analysis.json') as json_file:
#     data = json.load(json_file)

ModuleNotFoundError: No module named 'chart_studio'