# Pesticide Repository Classification

* The data that I choose as my main project is of the classification type.
* I will try to find out the product type

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.axes as ax
from pandas.plotting import scatter_matrix
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.dummy import DummyClassifier
from sklearn.metrics import mean_squared_error
%matplotlib inline

ModuleNotFoundError: No module named 'matplotlib'

Reading from file and allowing the data to be also hebrew

In [None]:
df = pd.read_csv('data/Pesticide Repository.csv', sep=',', encoding = "ISO-8859-8")
df

I try to predict the "סוג תכשיר" column, or in english the "preparation type num" column. I will try to get all the other data that in the table and find out which "סוג תכשיר" he is: 'אינם קוטלי עשבים' ,'קוטלי עשבים'  or 'אורגני'

In [None]:
df.info()

In [None]:
df.describe()

# Transform The Columns to numeric

First I will clean the database so it will be only with numbers and without words using dictionris

In [None]:
def str_array_to_dictionry(my_list):
    my_dict = dict() 
    for index,value in enumerate(my_list):
        my_dict[value] = index
    return my_dict

In [None]:
def add_new_column(old_name, new_name):
    value = str_array_to_dictionry(pd.unique(df[old_name]))
    print(value)
    new_column = []
    for val in df[old_name]:
        new_column.append(value[val])
    df[new_name] = new_column

In [None]:
add_new_column("סוג תכשיר", 'preparation type num')

In [None]:
add_new_column("סוג פעילות אנגלי", 'activity type num')

In [None]:
add_new_column("סוג קרקע", 'field type num')

In [None]:
add_new_column("חומר פעיל", 'active ingredient num')

In [None]:
add_new_column('פורמולציה אנגלי', 'formulation num')

In [None]:
add_new_column('רעילות אנגלי', 'toxicity num')

In [None]:
add_new_column('דרגת רעילות אנגלי','degree of toxicity num')

In [None]:
add_new_column('קבוצת גידולים אנגלי', 'crop group num')

In [None]:
add_new_column('קבוצת נגעים אנגלי', 'lesion group num')

In [None]:
new_column = []
for index,value in enumerate(df['ריכוש חומר פעיל אנגלי']):
    try:
        if value.split(" ")[1] == "%":
            if value.split(" ")[0]:
                value = float(value.split(" ")[0]) * 10
            else:
                value = 1
        else:
            value = value.split(" ")[0]
    except:
        value=0
    new_column.append(value)
df['active substance concentration num'] = new_column

Changing column name to english

In [None]:
df['license number'] = df['מספר רשיון']
df['Preparation name'] = df['שם תכשיר אנגלי']

In [None]:
clean_data = df[['license number', 'Preparation name', 'preparation type num', 'activity type num', 'field type num', 'active ingredient num', 'formulation num', 'toxicity num', 'degree of toxicity num', 'crop group num', 'lesion group num', 'active substance concentration num']]

Clean zeros and blank values from 'active substance concentration num' column

In [None]:
clean_data = clean_data.drop(clean_data[(clean_data['active substance concentration num']==0 )| (clean_data['active substance concentration num']=='')].index)

In [None]:
clean_data = clean_data.drop_duplicates(subset='license number')
clean_data

In [None]:
# Description
print(clean_data.describe())


preparation type num distribution


In [None]:
print(clean_data.groupby('preparation type num').size())


In [None]:
clean_data.info

# Splitting The Data

I will split my data into two parametrs:
* **X** - The data that I give to the model
* **y** - The data that I want to predict

In [None]:
X = clean_data.drop(['license number', 'preparation type num', 'Preparation name'],axis=1)
y = clean_data['preparation type num']

After choosing the columns that I want from the choosen data, I will split the data into test and traind variables:/\

In [None]:
X.columns

In [None]:

X_test, X_train, y_test, y_train= train_test_split(X, y, test_size = 0.2, random_state=42)


In [None]:
print('y_test:   ', y_test.shape)
print('X_set:   ', X_test.shape)

# Analyze the Data by Comparing the Different columns

In [None]:
sns.countplot(clean_data['preparation type num'],label="Count")
plt.show()

# Creating Dummy Model

In [None]:
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)
dummy_clf.score(X_train, y_train)

# Finding Model

I will try to find the best model to analize my data:

* I will try to use the KNN model, and find the best number of nearest neighbors:

In [None]:
k_range = range(1,50)
scores = []
for k in k_range:
    knn_ = KNeighborsClassifier(n_neighbors=k)
    knn_.fit(X_train, y_train)
    y_pred = knn_.predict(X_test)
    scores.append(metrics.accuracy_score(y_test, y_pred))

plt.figure(figsize=(12, 6))
plt.plot(k_range, scores,color='red', linestyle='dashed', marker='o',
         markerfacecolor='blue', markersize=10)
plt.xlabel('Value of K for KNN')
plt.ylabel('Testing Accuracy')

# Finding the maximum k - the number of nearest neighbors:
max_score = max(scores)
best_k = scores.index(max_score)
print("The best accuracy of the knn model is when k =",best_k, ", and the score is:",max_score) 

* I will try to use the SVM.SVC model with kernel='sigmoid':

In [None]:
sm = SVC(kernel='sigmoid')
sm.fit(X_train, y_train)
y_pred = sm.predict(X_test)
accuracy = metrics.accuracy_score(y_test, y_pred)
print("accuracy:   ", accuracy)

According to the models that I checked, the best accuracy scores is when I chosing knn model k = 5.

In [None]:
print("The accuracy score is:",max_score, ", or in another word:",(max_score*100),"%") 

# Error Model

In [None]:
mean_squared_error(y_test,sm.predict(X_test))

# Conclusion

* The accuracy score of the chosen model is 74.25%.

* I have seen that there is a connection between the components of the pesticide material, such as the amount of toxic substance, type of activity, type of soil, etc. and the type of preparation - non-herbicides, herbicides, organic.

# Part 2

In [None]:
# X_test, X_train, y_test, y_train= train_test_split(X, y, test_size = 0.2, random_state=42)

In [2]:
corr_matrix = X_train.corr()
corr_matrix["preparation type num"].sort_values(ascending=False)

NameError: name 'X_train' is not defined