<a href="https://colab.research.google.com/github/vijaydandu1/DataminingAdultDataset/blob/main/Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Loading the dataset**

In [1]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns # for visualization
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
warnings.filterwarnings("ignore")
np.random.seed(42)

df=  pd.read_csv("https://raw.githubusercontent.com/shstreuber/Data-Mining/master/data/West_Nile_Virus__WNV__Mosquito_Test_Results.csv")

In [3]:
df.head()

Unnamed: 0,SEASON YEAR,WEEK,TEST ID,BLOCK,TRAP,TRAP_TYPE,TEST DATE,NUMBER OF MOSQUITOES,RESULT,SPECIES,LATITUDE,LONGITUDE,LOCATION
0,2014,39,40542,100XX W OHARE AIRPORT,T902,GRAVID,09/25/2014 12:09:00 AM,8,negative,CULEX PIPIENS/RESTUANS,,,
1,2016,37,44219,100XX W OHARE AIRPORT,T902,GRAVID,09/15/2016 12:09:00 AM,39,negative,CULEX PIPIENS/RESTUANS,,,
2,2017,33,45351,100XX W OHARE AIRPORT,T905,GRAVID,08/17/2017 12:08:00 AM,50,positive,CULEX PIPIENS/RESTUANS,,,
3,2017,33,45345,100XX W OHARE AIRPORT,T900,GRAVID,08/17/2017 12:08:00 AM,17,positive,CULEX PIPIENS/RESTUANS,,,
4,2016,37,44169,4XX W 127TH,T135,GRAVID,09/15/2016 12:09:00 AM,12,negative,CULEX PIPIENS/RESTUANS,,,


## **Transforming the attribute to numeric**

In [4]:
df['TRAP']= df['TRAP'].astype('category')
df['TRAP']= df['TRAP'].cat.codes

In [5]:
df.head()

Unnamed: 0,SEASON YEAR,WEEK,TEST ID,BLOCK,TRAP,TRAP_TYPE,TEST DATE,NUMBER OF MOSQUITOES,RESULT,SPECIES,LATITUDE,LONGITUDE,LOCATION
0,2014,39,40542,100XX W OHARE AIRPORT,171,GRAVID,09/25/2014 12:09:00 AM,8,negative,CULEX PIPIENS/RESTUANS,,,
1,2016,37,44219,100XX W OHARE AIRPORT,171,GRAVID,09/15/2016 12:09:00 AM,39,negative,CULEX PIPIENS/RESTUANS,,,
2,2017,33,45351,100XX W OHARE AIRPORT,174,GRAVID,08/17/2017 12:08:00 AM,50,positive,CULEX PIPIENS/RESTUANS,,,
3,2017,33,45345,100XX W OHARE AIRPORT,169,GRAVID,08/17/2017 12:08:00 AM,17,positive,CULEX PIPIENS/RESTUANS,,,
4,2016,37,44169,4XX W 127TH,106,GRAVID,09/15/2016 12:09:00 AM,12,negative,CULEX PIPIENS/RESTUANS,,,


## **Standardizing the TRAP attribute**

In [6]:
unscaled_features = df[['TRAP']]
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

# Calculate μ & σ(fit) and apply the transformation(transform)
unscaled_features_array = sc.fit_transform(unscaled_features.values)

# Assign the scaled data to a DataFrame & use the index and columns arguments to keep your original indices and column names:
scaled_features = pd.DataFrame(unscaled_features_array, index=unscaled_features.index, columns=unscaled_features.columns)

scaled_features.head()

Unnamed: 0,TRAP
0,1.422334
1,1.422334
2,1.476493
3,1.386228
4,0.248881


## **Transforming the second attribute from numeric to xs -- s -- m -- l -- xl.**

In [7]:
df.WEEK.describe()

count    29489.000000
mean        31.073587
std          4.533390
min         20.000000
25%         28.000000
50%         31.000000
75%         35.000000
max         40.000000
Name: WEEK, dtype: float64

In [8]:
# Setting up our 5 buckets and their labels
buckets = [ 20, 28, 31, 35, 38, 40 ]
bucketlabels = ['xs', 's', 'm', 'l', 'xl']

# Using cut() to separate data into the buckets we have built
df['WEEK'] = pd.cut(df['WEEK'] , bins=buckets, labels=bucketlabels, include_lowest=True)

# Check if we have the new attribute 'buckets' at the end of our dataset
df.head()

Unnamed: 0,SEASON YEAR,WEEK,TEST ID,BLOCK,TRAP,TRAP_TYPE,TEST DATE,NUMBER OF MOSQUITOES,RESULT,SPECIES,LATITUDE,LONGITUDE,LOCATION
0,2014,xl,40542,100XX W OHARE AIRPORT,171,GRAVID,09/25/2014 12:09:00 AM,8,negative,CULEX PIPIENS/RESTUANS,,,
1,2016,l,44219,100XX W OHARE AIRPORT,171,GRAVID,09/15/2016 12:09:00 AM,39,negative,CULEX PIPIENS/RESTUANS,,,
2,2017,m,45351,100XX W OHARE AIRPORT,174,GRAVID,08/17/2017 12:08:00 AM,50,positive,CULEX PIPIENS/RESTUANS,,,
3,2017,m,45345,100XX W OHARE AIRPORT,169,GRAVID,08/17/2017 12:08:00 AM,17,positive,CULEX PIPIENS/RESTUANS,,,
4,2016,l,44169,4XX W 127TH,106,GRAVID,09/15/2016 12:09:00 AM,12,negative,CULEX PIPIENS/RESTUANS,,,


## **Finding all missing letters in Third attribute.**

In [9]:
df['TEST ID']

0        40542
1        44219
2        45351
3        45345
4        44169
         ...  
29484    21734
29485    27885
29486    36525
29487    22451
29488    42348
Name: TEST ID, Length: 29489, dtype: int64

In [10]:
df['TEST ID'].dtype

dtype('int64')

In [11]:
df['TEST ID'].isnull()

0        False
1        False
2        False
3        False
4        False
         ...  
29484    False
29485    False
29486    False
29487    False
29488    False
Name: TEST ID, Length: 29489, dtype: bool

In [12]:
bool_series = pd.isnull(df["TEST ID"])
df[bool_series] 

Unnamed: 0,SEASON YEAR,WEEK,TEST ID,BLOCK,TRAP,TRAP_TYPE,TEST DATE,NUMBER OF MOSQUITOES,RESULT,SPECIES,LATITUDE,LONGITUDE,LOCATION


From the above code i didn't find any null missing values in the third attribute(TEST ID).

##  **Five point summary**

In [13]:
df['SEASON YEAR'].describe()

count    29489.000000
mean      2012.502018
std          3.802700
min       2007.000000
25%       2009.000000
50%       2012.000000
75%       2016.000000
max       2019.000000
Name: SEASON YEAR, dtype: float64

## **Splitting the dataset into 60% training set and 40% test set with sampling**

In [14]:
sample=df.sample(frac=0.4, replace=True, random_state=1)
sample.shape

(11796, 13)

In [15]:
sample

Unnamed: 0,SEASON YEAR,WEEK,TEST ID,BLOCK,TRAP,TRAP_TYPE,TEST DATE,NUMBER OF MOSQUITOES,RESULT,SPECIES,LATITUDE,LONGITUDE,LOCATION
235,2017,m,45565,100XX W OHARE,173,GRAVID,08/31/2017 12:08:00 AM,20,positive,CULEX PIPIENS/RESTUANS,,,
12172,2015,m,41952,67XX S KEDZIE AVE,57,GRAVID,08/26/2015 12:08:00 AM,6,negative,CULEX PIPIENS/RESTUANS,41.771199,-87.703107,"(41.77119858797388, -87.70310660774493)"
5192,2013,xs,35680,100XX W OHARE AIRPORT,185,GRAVID,06/14/2013 12:06:00 AM,34,negative,CULEX RESTUANS,,,
17289,2011,m,32245,51XX N MONT CLARE AVE,149,GRAVID,08/26/2011 12:08:00 AM,5,positive,CULEX PIPIENS/RESTUANS,41.974523,-87.804589,"(41.974522761157274, -87.80458946950488)"
10955,2011,s,31858,36XX N PITTSBURGH AVE,13,GRAVID,08/05/2011 12:08:00 AM,18,negative,CULEX PIPIENS/RESTUANS,41.945961,-87.832942,"(41.94596109447193, -87.83294247349616)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
23763,2007,s,21758,35XX W 116TH ST,127,GRAVID,08/07/2007 12:08:00 AM,29,negative,CULEX PIPIENS/RESTUANS,41.682180,-87.710092,"(41.6821804028987, -87.71009165099498)"
11353,2014,m,39665,10XX E 67TH ST,64,GRAVID,08/21/2014 12:08:00 AM,50,negative,CULEX PIPIENS/RESTUANS,41.773085,-87.600168,"(41.773085401492715, -87.60016755939222)"
2108,2019,m,48966,100XX W OHARE AIRPORT,185,GRAVID,08/08/2019 12:08:00 AM,2,negative,CULEX PIPIENS,,,
9158,2008,xs,24284,70XX N MOSELLE AVE,10,GRAVID,07/01/2008 12:07:00 AM,4,negative,CULEX RESTUANS,42.007998,-87.778235,"(42.007997503125345, -87.77823496507851)"


## **Splitting the dataset into 75% training set and 25% test set with crossvalidation**

In [16]:
from sklearn.model_selection import train_test_split
x=df.iloc[:,:12] # all parameters
y=df['NUMBER OF MOSQUITOES']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.25) 
print("X_train shape: {}".format(X_train.shape))
print("X_test shape: {}".format(X_test.shape))

X_train shape: (22116, 12)
X_test shape: (7373, 12)


In [17]:
x

Unnamed: 0,SEASON YEAR,WEEK,TEST ID,BLOCK,TRAP,TRAP_TYPE,TEST DATE,NUMBER OF MOSQUITOES,RESULT,SPECIES,LATITUDE,LONGITUDE
0,2014,xl,40542,100XX W OHARE AIRPORT,171,GRAVID,09/25/2014 12:09:00 AM,8,negative,CULEX PIPIENS/RESTUANS,,
1,2016,l,44219,100XX W OHARE AIRPORT,171,GRAVID,09/15/2016 12:09:00 AM,39,negative,CULEX PIPIENS/RESTUANS,,
2,2017,m,45351,100XX W OHARE AIRPORT,174,GRAVID,08/17/2017 12:08:00 AM,50,positive,CULEX PIPIENS/RESTUANS,,
3,2017,m,45345,100XX W OHARE AIRPORT,169,GRAVID,08/17/2017 12:08:00 AM,17,positive,CULEX PIPIENS/RESTUANS,,
4,2016,l,44169,4XX W 127TH,106,GRAVID,09/15/2016 12:09:00 AM,12,negative,CULEX PIPIENS/RESTUANS,,
...,...,...,...,...,...,...,...,...,...,...,...,...
29484,2007,s,21734,22XX W 113TH ST,78,GRAVID,08/07/2007 12:08:00 AM,23,negative,CULEX PIPIENS/RESTUANS,41.688171,-87.678252
29485,2009,m,27885,58XX N WESTERN AVE,24,GRAVID,08/25/2009 12:08:00 AM,1,negative,CULEX PIPIENS/RESTUANS,41.987245,-87.689417
29486,2013,s,36525,109XX S COTTAGE GROVE AVE,96,GRAVID,07/25/2013 12:07:00 AM,6,negative,CULEX RESTUANS,41.695494,-87.609082
29487,2007,m,22451,73XX S CICERO AVE,58,GRAVID,08/21/2007 12:08:00 AM,6,negative,CULEX PIPIENS/RESTUANS,41.760082,-87.741607


## **Picking the 3 most important attributes and make a correlation matrix**

In [18]:
df2 = pd.DataFrame(df, columns = ['SEASON YEAR', 'TEST ID', 'NUMBER OF MOSQUITOES'])

In [19]:
corr = df2.corr()
corr

Unnamed: 0,SEASON YEAR,TEST ID,NUMBER OF MOSQUITOES
SEASON YEAR,1.0,0.994706,-0.018555
TEST ID,0.994706,1.0,-0.022352
NUMBER OF MOSQUITOES,-0.018555,-0.022352,1.0


## **Performaning Random Forest Model**

In [20]:
df3= pd.DataFrame(df, columns = ['SEASON YEAR', 'TEST ID', 'NUMBER OF MOSQUITOES','RESULT'])

In [21]:
from sklearn.model_selection import train_test_split
x=df3.iloc[:,:3] # all parameters
y=df3['RESULT'] # class labels 'southwest', 'southeast', 'northwest', 'northeast'
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.25) 
print("X_train shape: {}".format(X_train.shape))
print("X_test shape: {}".format(X_test.shape))

X_train shape: (22116, 3)
X_test shape: (7373, 3)


In [22]:
rf = RandomForestClassifier()
rf.get_params

<bound method BaseEstimator.get_params of RandomForestClassifier()>

In [23]:
rf.fit(X_train, y_train)

RandomForestClassifier()

In [24]:
y_pred = rf.predict(X_test)

In [25]:
accuracy_score(y_test, y_pred)

0.9076359690763597