Машинное обучение 

Курсовая работа

Выполнила: Иппо Вера, гр.P4117


Задача: по набору ингредиентов определить, к какой кухне принадлежит рецепт. https://www.kaggle.com/c/whats-cooking   
Входные данные: Набор рецептов в json-формате.  
Пример:  
 {
 "id": 24717,
 "cuisine": "indian",
 "ingredients": [
     "tumeric",
     "vegetable stock",
     "tomatoes",
     "garam masala",
     "naan",
     "red lentils",
     "red chili peppers",
     "onions",
     "spinach",
     "sweet potatoes"
 ]
 },


В ходе выполнения задания для оценки важности слов - ингредиентов в рецептах использовалась мера TF-IDF(TF-term frequency, IDF — inverse document frequency), заключающаяся в вычислении произведения двух характеристик - числа вхождений слова по отношению к общему числу слов в документе(в данном случае рецепте) и инверсии частоты слова в документах коллекции(в данном случае во всех рецептах)

In [1]:
import pandas
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier

In [2]:
# загрузка датасета
def load_dataset(filename):
    dataset = pandas.read_json(filename)
    return dataset

In [3]:
# разбиение на обучающую и тестовую выборки
def split_dataset(dataset_df, test_size):
    ingredients = dataset_df['ingredients'].values  # список ингредиентов (признаков) для каждой кухни
    array_of_ingredients = make_array_of_strings(ingredients)
    class_cuisine = dataset_df['cuisine'].values  # кухни мира
    data_train, data_test, class_train, class_test = train_test_split(array_of_ingredients, class_cuisine,
                                                                      test_size=test_size)
    return data_train, class_train, data_test, class_test

In [4]:
def make_array_of_strings(receipts):
    array_of_ingredients = []
    for ingredients in receipts:
        str_ingr = " "
        items = []
        for item in ingredients:
            item = "_".join(item.split(" "))
            items.append(item)
        str_ingr = str_ingr.join(items)
        array_of_ingredients.append(str_ingr)
    return array_of_ingredients

In [5]:
def pred_test_data(test_data, vectorizer, classifier):
    ingredients = test_data["ingredients"].values
    test_data = make_array_of_strings(ingredients)
    X_test_data = vectorizer.transform(test_data)
    pred_cuisines = classifier.predict(X_test_data)
    return pred_cuisines

In [6]:
def print_submission(ids, predictions):
    print("id,cuisine")
    i=0
    for id in ids:
        print(id, predictions[i])
        i=i+1

In [7]:
dataset = load_dataset("train.json/train.json")
data_train, class_train, data_test, class_test = split_dataset(dataset, 0.25)

In [8]:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5)
X_train = vectorizer.fit_transform(data_train)
X_test = vectorizer.transform(data_test)

In [9]:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest = random_forest.fit(X_train, class_train)
pred_forest = random_forest.predict(X_test)
score = metrics.accuracy_score(class_test, pred_forest)
print("RandomForest accuracy:   %0.3f" % score)
#RandomForest accuracy: 0.710

RandomForest accuracy:   0.713


In [10]:
kn_classifier = KNeighborsClassifier(n_neighbors=10)
kn_classifier = kn_classifier.fit(X_train,class_train)
pred_kn = kn_classifier.predict(X_test)
score = metrics.accuracy_score(class_test, pred_kn)
print("KNeighbors accuracy:   %0.3f" % score)

KNeighbors accuracy:   0.729


In [11]:
test_data = load_dataset("test.json/test.json")
pred_cuisines = pred_test_data(test_data,vectorizer,random_forest)
print_submission(test_data["id"].values, pred_cuisines)

id,cuisine
18009 irish
28583 southern_us
41580 italian
29752 cajun_creole
35687 italian
38527 southern_us
19666 spanish
41217 chinese
28753 mexican
22659 british
21749 italian
44967 italian
42969 indian
44883 italian
20827 southern_us
23196 mexican
35387 southern_us
33780 southern_us
19001 mexican
16526 southern_us
42455 japanese
47453 indian
42478 french
11885 indian
16585 italian
29639 southern_us
26245 thai
38516 korean
47520 italian
26212 southern_us
23696 mexican
14926 thai
13292 southern_us
27346 japanese
1384 chinese
15959 mexican
42297 russian
46235 indian
21181 indian
9809 cajun_creole
22982 cajun_creole
23120 chinese
26743 french
12741 mexican
7050 italian
22437 italian
14760 spanish
10649 indian
48065 vietnamese
39931 chinese
32935 italian
40142 thai
19103 indian
38267 italian
30512 italian
3324 chinese
26197 italian
384 japanese
47230 chinese
39748 filipino
41038 french
24878 mexican
42624 filipino
45706 korean
8874 mexican
23878 greek
47234 mexican
24075 thai
46323 italian

35326 indian
35922 mexican
28171 italian
1322 italian
14923 southern_us
42344 italian
16380 mexican
15233 indian
8338 chinese
41014 filipino
30440 mexican
23043 chinese
20788 southern_us
14375 southern_us
1184 chinese
34280 italian
46171 cajun_creole
31834 chinese
14759 mexican
15855 chinese
6662 italian
15056 greek
34873 chinese
20089 southern_us
20015 mexican
18402 italian
6418 french
1454 chinese
14070 chinese
26424 chinese
27349 indian
26459 vietnamese
13151 italian
12264 italian
18366 italian
46851 indian
18954 italian
39858 chinese
43500 cajun_creole
21656 italian
10505 chinese
5313 moroccan
28951 japanese
19961 indian
1795 italian
9917 french
26180 mexican
9726 italian
47689 italian
7866 mexican
41625 mexican
35160 mexican
4430 southern_us
39082 french
46186 indian
11988 indian
24177 french
31849 italian
41194 moroccan
36968 mexican
42330 chinese
36791 thai
19704 italian
4933 indian
4255 mexican
37105 italian
35848 mexican
17416 mexican
39066 italian
36270 cajun_creole
24406 fil

12756 mexican
32462 southern_us
42204 vietnamese
246 mexican
47061 korean
15908 thai
29572 italian
13458 mexican
14022 southern_us
40437 vietnamese
48136 mexican
17687 italian
48504 brazilian
22857 southern_us
11465 indian
42107 thai
12514 mexican
46339 indian
8383 southern_us
27362 irish
38646 thai
44084 korean
12946 thai
11670 indian
34916 italian
2944 indian
33063 mexican
26982 jamaican
2231 mexican
5118 indian
28258 italian
44113 southern_us
47328 chinese
33937 filipino
40399 italian
23562 vietnamese
12787 mexican
13992 japanese
38992 italian
45269 mexican
15211 moroccan
34602 southern_us
41779 vietnamese
4179 mexican
37925 japanese
46161 italian
5512 southern_us
27845 mexican
12024 italian
41072 irish
17911 italian
4237 chinese
40442 italian
15522 mexican
6804 southern_us
43807 thai
21647 chinese
22758 italian
32551 mexican
20470 mexican
20842 southern_us
43018 indian
45493 italian
42442 italian
14655 italian
32247 southern_us
32302 mexican
42075 mexican
19949 mexican
23352 southe

3835 italian
1960 indian
43991 southern_us
45136 italian
41554 southern_us
10118 japanese
43827 mexican
32583 mexican
31949 italian
23574 chinese
28211 chinese
21793 mexican
25914 mexican
34735 spanish
26071 chinese
9476 italian
29443 southern_us
16966 chinese
6457 southern_us
25882 mexican
48390 greek
28356 southern_us
15699 southern_us
2090 spanish
4686 italian
27937 thai
25560 mexican
38937 italian
48670 southern_us
2547 french
17442 southern_us
28669 cajun_creole
4410 mexican
36917 chinese
46755 indian
25321 southern_us
47019 chinese
33077 mexican
11593 french
44295 mexican
46632 chinese
18503 italian
29030 southern_us
32516 french
35128 southern_us
41715 greek
49185 indian
35517 italian
13909 southern_us
36886 southern_us
29304 mexican
29577 mexican
24570 mexican
29466 chinese
10724 indian
30026 italian
37133 southern_us
29346 thai
32871 thai
46420 chinese
29083 chinese
4402 italian
15877 italian
33448 indian
3541 mexican
27704 indian
28658 italian
23793 italian
11040 irish
28419 

43425 thai
43325 mexican
24159 italian
2121 chinese
15876 cajun_creole
11191 southern_us
4338 mexican
4072 southern_us
29363 french
29278 indian
4276 spanish
47413 french
16670 italian
31767 italian
37741 vietnamese
45568 greek
13871 southern_us
15015 indian
1536 thai
1466 thai
10405 mexican
15963 italian
45075 italian
2588 italian
17070 italian
14012 thai
44422 southern_us
14609 southern_us
46913 mexican
5710 french
16086 french
48335 southern_us
41667 italian
20551 italian
29812 french
32212 chinese
18 italian
14823 southern_us
30464 chinese
31748 greek
8081 mexican
2484 greek
41751 mexican
25020 italian
6889 mexican
37348 russian
27337 indian
21625 southern_us
43407 moroccan
737 southern_us
42668 french
36758 southern_us
3528 cajun_creole
40542 chinese
8277 chinese
22826 mexican
16937 italian
8216 italian
10738 brazilian
11932 chinese
21240 italian
14296 filipino
24128 italian
44718 southern_us
40317 southern_us
22385 italian
16417 southern_us
19098 southern_us
36846 mexican
10401 c

9407 japanese
4008 mexican
6472 mexican
43384 french
47289 chinese
27905 southern_us
9628 indian
41613 chinese
8128 indian
33224 mexican
21571 italian
41246 french
48240 mexican
46816 italian
45778 italian
13798 italian
31606 southern_us
43293 italian
27725 italian
7614 southern_us
24798 italian
16256 mexican
10296 mexican
46895 mexican
18706 italian
25895 chinese
42986 indian
42194 southern_us
44727 french
2381 mexican
7850 indian
21066 southern_us
1427 vietnamese
30832 french
44593 chinese
35604 italian
42802 italian
30307 greek
8893 japanese
43086 indian
278 southern_us
48861 italian
9641 indian
18801 french
13009 indian
42823 italian
3954 mexican
49035 italian
34278 thai
25941 southern_us
38577 southern_us
37973 southern_us
813 italian
29444 southern_us
30058 french
34601 southern_us
42370 italian
9228 southern_us
12638 southern_us
37549 french
19802 cajun_creole
30931 thai
22179 chinese
44664 mexican
42631 indian
23407 chinese
30557 vietnamese
28888 italian
326 thai
47891 mexican


40548 indian
22858 mexican
6224 greek
31972 italian
4190 southern_us
22066 southern_us
45809 mexican
16888 italian
41177 french
2585 filipino
34835 southern_us
27465 italian
32614 italian
23845 japanese
8345 southern_us
15313 chinese
36741 chinese
6845 french
24184 italian
10632 mexican
46732 italian
7328 mexican
37610 italian
39472 southern_us
26291 southern_us
30621 italian
5188 southern_us
29760 mexican
45337 southern_us
31663 french
3965 mexican
19981 italian
23268 southern_us
39421 thai
19286 indian
33493 chinese
18035 italian
19669 chinese
6073 southern_us
10457 southern_us
24605 indian
37873 mexican
3753 southern_us
8218 southern_us
28513 italian
40161 mexican
35777 italian
38159 chinese
20380 italian
20309 italian
22178 french
44534 southern_us
36812 italian
771 italian
14453 southern_us
37861 italian
40670 chinese
33687 italian
11720 chinese
41612 mexican
27089 indian
18732 italian
46522 italian
44217 thai
10202 italian
40090 mexican
9314 mexican
38428 southern_us
25439 cajun_

30697 italian
46439 chinese
38885 italian
20199 chinese
45830 french
10050 indian
20936 mexican
32765 italian
28668 french
45215 southern_us
6217 mexican
40777 mexican
46472 greek
2281 italian
38572 italian
6585 mexican
14684 mexican
9667 mexican
14907 cajun_creole
8693 italian
18101 italian
21028 indian
18313 southern_us
9982 filipino
6976 southern_us
40781 chinese
46741 moroccan
3444 italian
45802 italian
12432 southern_us
6856 french
23581 mexican
9211 french
34069 chinese
22324 jamaican
26983 brazilian
35515 italian
19409 cajun_creole
9731 indian
39758 southern_us
42449 thai
22877 chinese
23264 mexican
26064 italian
9606 italian
7595 korean
10261 cajun_creole
24981 chinese
47183 southern_us
12375 italian
23489 southern_us
3309 french
44269 southern_us
21212 italian
44846 italian
3875 italian
36756 indian
9086 southern_us
29308 mexican
19797 french
21599 italian
36287 brazilian
27373 chinese
12132 cajun_creole
24708 italian
22669 southern_us
46213 mexican
47599 moroccan
37858 indian

28289 thai
18355 southern_us
14703 greek
3461 italian
45098 cajun_creole
11405 jamaican
46844 mexican
26361 italian
6523 italian
24156 british
29448 southern_us
1001 mexican
17391 southern_us
12403 southern_us
25749 southern_us
21618 mexican
20383 chinese
12910 french
44894 italian
24454 french
5825 mexican
11 vietnamese
23267 southern_us
49099 italian
42191 indian
10319 greek
11648 greek
4412 korean
44603 japanese
12311 greek
9352 chinese
47952 italian
49425 italian
43678 mexican
10148 italian
16081 chinese
49688 italian
32985 mexican
11136 mexican
41207 southern_us
46821 indian
15891 italian
40458 chinese
11581 italian
45144 italian
47545 southern_us
21086 french
35770 mexican
35552 korean
21580 southern_us
191 indian
11292 brazilian
5941 mexican
25711 mexican
43796 indian
29280 mexican
9015 indian
28401 southern_us
5686 mexican
4547 mexican
35474 mexican
48123 mexican
43267 greek
1135 southern_us
37990 french
9701 french
47511 southern_us
19287 vietnamese
39299 mexican
24289 italian

Результат проверки решения системой Kaggle:  
submission1.csv  3 days ago  0 seconds  0 seconds   0.70917  

Вывод: точность определения рецепта по ингредиентам выше при использовании алгоритма KNeighbors, дальнейшее повышение точности возможно  использованием другого алгоритма, например SVC или использованием другого алгоритма оценки важности  слова.
