# Module 6 #
### Expected Time to Complete: 6hrs <tba>###
### Total Marks: 18 marks <tba>### 

## About ##

This Jupyter Notebook contains all of the portfolio assessment related to Module 6. The assessment tasks start off with some simple data I/O tasks and gradually build up to more complicated data access tasks that deal with different formats and different platforms. The focus of these tasks is on building the skills that are required for accessing data - from very simple data I/O to more complex data access. Such skills are a critical but often overlooked part of data analysis. We will also be working with data from our case study 'Credit Card Fraud Detection' in this notebook. Cyber Analytics is the broader theme of the course so we try to ensure that you have as many opportunities as possible to work with that type of data. 

Please ensure you refer to the marking rubric for this assessment item while completing the following assessment tasks.

## Section 1 - Data Access Basics ##

### Section 1.1 - Data Access Fundamentals ###

Following along with Mark Quinn's (2017) YouTube video on basic data I/O write a block of code that takes a 7 digit user security code plus their username and displays it to the screen. Also include a check on the security code.

### Section 1.2 - Formatting Output ###

Using the official python docs tutorial as a reference use formatted string literals (f-strings) to output a customer name and their credit card number using appropriate variables

Perform the same function as in the above but with the str.format() method

Using f-strings create a table that has two columns (customer name, credit card number) and four rows of data. Output this in a way that ensures the columns will always line up (hint: ensure a minimum number of characters via ':')  

Perform the same as in the above but using the string format method

Let's try this now with manual string formatting using a combination of left and right jusification for the customer name and credit card numbers

And now the same thing using old string formatting (ie. % operator)

## Section 2 - Accessing Data (Different Formats, Different Plaforms) ##

### Section 2.1 - Reading and Writing Files ###

Following the official python docs tutorial create and open a file called customers in read and write mode.

If you wanted to avoid problems caused by loading an extremely large file into memory show how you would do so

Show two different methods for reading lines from a file (all of the lines)

In the customers file you created previously write the following customer value to the file:
customer = ('Jane Doe', 47, 78965432)

Show how you would write the following dictionary to a file called 'security_customers'. Make sure you do so in a way that is easy, allows for interoperability and incorporates best practice.


In [17]:
customers = {50789: {'name': 'Baptiste', 'age': '52', 'status': 0},
             72658: {'name': 'Helena', 'age': '26', 'status': 1}}

### Section 2.2 - Files and Exceptions###

Append the following customers to your previously created file 'security_customers'

In [None]:
new_customers = {60793: {'name': 'Mary', 'age': '15', 'status': 1},
             72658: {'name': 'Barry', 'age': '32', 'status': 1}}

Open the file 'security_customers' and read the entire file

Write an exception block for the case where the file doesn't exist. To avoid a FileNotFoundError allow the program to fail silently.

Write a more appropriate exception block for the case where the file does not exist. In this case provide a useful message to the user.

### Section 2.3 - Pandas IO tools and different data formats ###
There are no associated exercises for this section. This section simply refers to the pandas documentation which you should familiarise yourself with here (https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) . Take this opportunity - if you haven't done so already - to take a look at this documentation. This documentation will prove useful in the following sections. 

### Section 2.4 - Data Loading, Storage and File Formats###

In this section we are going to start working with a data set that is specific to the cyber analytics context. This is the Kaggle Credit card fraud data set which can be found here: https://www.kaggle.com/mlg-ulb/creditcardfraud 

Using McKinney(2017) as a guide load the Kaggle credit card data into a pandas dataframe. Make sure that you account for the column names appropriately (demonstrating that you have taken time to understand the data) as well as potential missing values. As this is a csv make sure to use the pandas.read_csv() reader to do so. 

There are a number of different methods to find out how much memory is consumed by the dataframe: sys.getsizeof(df), df.memory_usage() and df.info(). There is a discussion on stackoverflow on this here https://stackoverflow.com/questions/18089667/how-to-estimate-how-much-memory-a-pandas-dataframe-will-need/47751572 . Use one of these methods to find out how much memory your dataframe in the previous step takes up.  

Provide commentary below on whether this particular file is considered 'large' and exactly what impact this has. Either demonstrate this with code or provide references to back up your assertion. Lee (2017) (https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c) is a reasonable reference point to start with here. However, you might find some better references yourself. 

Show the four official ways that the pandas docs suggest for avoiding taking up too much memory by reading a large file. (see https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html) Demonstrate this with the Kaggle data set.

The Pandas documentation introduced in Section 2.3 show how to look at performance metrics for different IO methods (https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#performance-considerations). Reproduce this here but use the Kaggle data to do so. 

Comment on what your top three performers are in terms of speed. How does this relate to the pandas documentation? If there are differences comment on why you believe there are differences. 

Based on these tests plus what you know about these methods what would your recommended IO method be for this particular data set? Justify your recommendation. Remember all IO methods have their advantages and disadvantages.

Write the Kaggle data set to a sqlite database and then show how you would load this data from the database into a pandas dataframe.

## Section 3 - Creating Coherent Data Sets (PCAP mini-project) ##

In the following sections of this notebook you should address the following requirements:

    1) Use scapy to explore and understand data in a pcap file (Read the pcap file, explore a single item in the pcap, understand object types in scapy and importing layers)
    2) Convert data in your pcap file to a workable Pandas DataFrame
    3) Write your dataframe to a csv file (4 columns only: source address, destination adress, source port and destination port)


For 1) and 2) above we recommend Ronald Eddings blog post 'Learning Packet Analysis with Data Science' for his overview of pcap, python and pandas (https://medium.com/hackervalleystudio/learning-packet-analysis-with-data-science-5356a3340d4e).  Following along with his tutorial may simply be a little too difficult for some however - the downloads are quite large and sniffing packets on your own system may not be practical for many due to permissions (you must be root). However, Eddings Jupyter Notebook on github (https://github.com/secdevopsai/Packet-Analytics/blob/master/Packet-Analytics.ipynb) and the official Scapy docs (https://scapy.readthedocs.io/en/latest/introduction.html) provide enough information for you to explore and understand data in a pcap file. To make things a little easier we have provided you with a pcap file to use EK_MALWARE_2014-08-01-Nuclear-EK-traffic_mailware-traffic-analysis.net.pcap. You can get this file from GitHub here (https://github.com/sandy-75/COIT20280). This pcap file was sourced from a bank of malicious and exploit pcaps from Contagio (http://contagiodump.blogspot.com/2013/04/collection-of-pcap-files-from-malware.html). 

<EK_MALWARE_2014-08-01-Nuclear-EK-traffic_mailware-traffic-analysis.net.pcap: TCP:2080 UDP:227 ICMP:0 Other:0>

Ether / IP / TCP 178.154.131.216:http > 172.16.165.132:49474 A / Raw
IP / TCP 178.154.131.216:http > 172.16.165.132:49474 A / Raw
TCP 178.154.131.216:http > 172.16.165.132:49474 A / Raw
Raw
###[ Ethernet ]### 
  dst       = 00:0c:29:c5:b7:a1
  src       = 00:50:56:f3:ca:52
  type      = IPv4
###[ IP ]### 
     version   = 4
     ihl       = 5
     tos       = 0x0
     len       = 1500
     id        = 21614
     flags     = 
     frag      = 0
     ttl       = 128
     proto     = tcp
     chksum    = 0x58a6
     src       = 178.154.131.216
     dst       = 172.16.165.132
     \options   \
###[ TCP ]### 
        sport     = http
        dport     = 49474
        seq       = 482050258
        ack       = 1552734652
        dataofs   = 5
        reserved  = 0
        flags     = A
        window    = 64239
        chksum    = 0x30ca
        urgptr    = 0
        options   = []
###[ Raw ]### 
           load      = 'HTTP/1.1 200 OK\r\nServer: nginx/1.6.0\r\nDate: Fri, 01 Aug 2014 00:50:43

## Section 4 - Security Implications ##

## Section 5 - The Significance of CyberAnalytics ##