### Objective:  

Develop a brief understanding of the dataset with which you will be working and load in the dataset and perform basic data inspection - how many features are there in the dataset, how many unique labels are there in the dataset and so on.

### 1. Imports

In [15]:
import numpy as np
import pandas as pd
import tensorflow as tf

It is difficult to reproduce the same results across runs for our machine learning model — even when we run the same script on the same training data. 

It could also lead to challenges in figuring out whether a change in performance is due to an actual model or data modification, or merely the result of a new random sample.

These sources of variation can be tackled to have full visibility into the data, model and parameters, and details on the environment that led to a specific result. This level of reproducibility will reduce unexpected variations across our runs and help us debug machine learning experiments.

To achive reproducability, randomness appears in machine learning ---- to achieve reproducible, deterministic, and more generalizable results we can carefully set the random seed.

We carefully set that seed variable for all of our frameworks

In [16]:
np.random.seed(123)
tf.random.set_seed(123)

## 2. The Data

### 2.1 Load the data....

In [17]:
df = pd.read_csv('Phishing.csv')

### 2.2 Explore the data

It's important to look at our data, to make sure we understand the format, how it's stored, what type of values it holds, etc. Even if we've read descriptions about our data, the actual data may not be what we expect.

In [18]:
df.head()

Unnamed: 0,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,Favicon,port,HTTPS_token,Request_URL,URL_of_Anchor,Links_in_tags,SFH,Submitting_to_email,Abnormal_URL,Redirect,on_mouseover,RightClick,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
0,-1,1,1,1,-1,-1,-1,-1,-1,1,1,-1,1,-1,1,-1,-1,-1,0,1,1,1,1,-1,-1,-1,-1,1,1,-1,-1
1,1,1,1,1,1,-1,0,1,-1,1,1,-1,1,0,-1,-1,1,1,0,1,1,1,1,-1,-1,0,-1,1,1,1,-1
2,1,0,1,1,1,-1,-1,-1,-1,1,1,-1,1,0,-1,-1,-1,-1,0,1,1,1,1,1,-1,1,-1,1,0,-1,-1
3,1,0,1,1,1,-1,-1,-1,1,1,1,-1,-1,0,0,-1,1,1,0,1,1,1,1,-1,-1,1,-1,1,-1,1,-1
4,1,0,-1,1,1,-1,1,1,-1,1,1,1,1,0,0,-1,1,1,0,-1,1,-1,1,-1,-1,0,-1,1,1,1,1


 The dataset has a large number of features, so we modify the output a bit to make it fit the screen.
 

In [19]:
df.head(10).transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
having_IP_Address,-1,1,1,1,1,-1,1,1,1,1
URL_Length,1,1,0,0,0,0,0,0,0,1
Shortining_Service,1,1,1,1,-1,-1,-1,1,-1,-1
having_At_Symbol,1,1,1,1,1,1,1,1,1,1
double_slash_redirecting,-1,1,1,1,1,-1,1,1,1,1
Prefix_Suffix,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
having_Sub_Domain,-1,0,-1,-1,1,1,-1,-1,1,-1
SSLfinal_State,-1,1,-1,-1,1,1,-1,-1,1,1
Domain_registeration_length,-1,-1,-1,1,-1,-1,1,1,-1,-1
Favicon,1,1,1,1,1,1,1,1,1,1


### The exact dimensions of the dataset

In [21]:
df.shape

(11055, 31)

The ndim attribute returns the number of dimensions of  DataFrame or Series. It will always be 2 for DataFrames and 1 for Series

In [22]:
df.ndim

2

### Unique Labels in the dataset

The labels of the dataset are given in the Result column. We can explore the unique labels in Result column.

In [32]:
unique_Labels = df.Result.unique()

In [34]:
unique_Labels

array([-1,  1])

In [33]:
unique_Labels.size

2

In [38]:
print ("The Result column has ", unique_Labels.size ," unique values i.e.", unique_Labels)

The Result column has  2  unique values i.e. [-1  1]


### Summary of dataset

In [47]:
(rows,cols)= df.shape  
print ("The fishing dataset has ",rows," samples and ",cols, "features")
print("The dataset has ", df.ndim, "dimensions")
print ("The Result column has ", unique_Labels.size ," unique values i.e.", unique_Labels)

The fishing dataset has  11055  samples and  31 features
The datset has  2 dimensions
The Result column has  2  unique values i.e. [-1  1]
