## Pandas Quick Overview

First we import the pandas module into Python as *pd* which is a convention
many people use.
Then we create a dataframe using the Pandas **read_csv()** method.

In [27]:
import pandas as pd

data = pd.read_csv('../datasets/Phishing_Email.csv')

The **data.shape** attribute shows you the shape of the data frame.  In this case, 18650 rows x 3 columns.


In [39]:
print(data.shape)

(18650, 3)


**data.info()** gives you basic information about the data frame.


In [40]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18650 entries, 0 to 18649
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  18650 non-null  int64 
 1   Email Text  18634 non-null  object
 2   Email Type  18650 non-null  object
dtypes: int64(1), object(2)
memory usage: 437.2+ KB
None


**data.head()** shows you the first 5 rows of data to give you a sense of what it looks like.  In this
case we have a labelled phishing email dataset whereby the *Email Type* column shows the values "Safe Email" versus "Phishing Email".


In [10]:
print(data.head())

   Unnamed: 0                                         Email Text  \
0           0  re : 6 . 1100 , disc : uniformitarianism , re ...   
1           1  the other side of * galicismos * * galicismo *...   
2           2  re : equistar deal tickets are you still avail...   
3           3  \nHello I am your hot lil horny toy.\n    I am...   
4           4  software at incredibly low prices ( 86 % lower...   

       Email Type  
0      Safe Email  
1      Safe Email  
2      Safe Email  
3  Phishing Email  
4  Phishing Email  


**data.describe()** will perform some basic statistics on the data frame yielding row counts, confidence intervals, max, min, and mean values. It's a good way to get an all around sense of what data we are working with.

In [18]:
print(data.describe())

         Unnamed: 0
count  18650.000000
mean    9325.154477
std     5384.327293
min        0.000000
25%     4662.250000
50%     9325.500000
75%    13987.750000
max    18650.000000


* **data.isnull()** finds any rows in the data frame that have null values.
* using **sum()** allows us to see a count of those rows with null values.
* **data.dropna()** cleans up the null values by dropping those rows.


In [32]:
print(data.isnull().sum())
print('>>>>>>>>>>>>>')
clean_data = data.dropna()
print(clean_data.info())

Unnamed: 0     0
Email Text    16
Email Type     0
dtype: int64
>>>>>>>>>>>>>
<class 'pandas.core.frame.DataFrame'>
Index: 18634 entries, 0 to 18649
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  18634 non-null  int64 
 1   Email Text  18634 non-null  object
 2   Email Type  18634 non-null  object
dtypes: int64(1), object(2)
memory usage: 582.3+ KB
None
