# Simulating the data
Before we go into some exploratory data analysis, let's see how we simulated the data:

```
# example of how to run this simulation from jupyter (remove the ! to run from the command line)
!python simulate.py -s 0 --stealthy -l logs/jan_2018.csv -hl logs/hackers_jan_2018.csv 31 "2018-01-01" 0.01 0.5
```

| Month | Probability of attack in a given hour | Probability of trying entire userbase | Vary IP addresses? |
| --- | --- | --- | --- |
| Jan 2018 | 1.00% | 50% | Yes |
| Feb 2018 | 0.50% | 25% | Yes |
| Mar 2018 | 0.10% | 10% | Yes |
| Apr 2018 | 1.00% | 65% | Yes |
| May 2018 | 0.01% | 5% | Yes |
| Jun 2018 | 0.05% | 5% | Yes |
| Jul 2018 | 1.00% | 15% | Yes |
| Aug 2018 | 0.50% | 10% | Yes |
| Sep 2018 | 0.50% | 10% | No |
| Oct 2018 | 0.20% | 12% | No |
| Nov 2018 | 0.70% | 17% | Yes |
| Dec 2018 | 8.00% | 88% | Yes |
| Jan 2019 | 0.80% | 8% | Yes |
| Feb 2019 | 0.10% | 18% | Yes |
| Mar 2019 | 0.10% | 18% | Yes |

We use pandas to combine the files by year. First, we create a utility function for concatenating the files:

```
import pandas as pd

def cat_csvs(format_string_file_pattern, index_col, month_list):
    """
    Utility function for concatentating CSV files from simulation.
    
    Parameters: 
        - format_string_file_pattern: The pattern for the file name with `{}` in the place of the month
        - index_col: The column with the datetimes to sort on.
        - month_list: The list of the months as formatted in the file names.
    
    Returns:
        A concatenated pandas DataFrame
    """
    return pd.concat([
        pd.read_csv(
            format_string_file_pattern.format(file), index_col=index_col, parse_dates=True
        ) for file in month_list
    ])
```

Next, we concatenate the 2018 logs making sure to not record any data from early January 1, 2019 which may have been generated from the Poisson process in December 2018:
```
logs_2018 = cat_csvs(
    'logs/{}_2018.csv', 'datetime', 
    ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']
)
logs_2018['2018'].sort_index().to_csv('logs/logs_2018.csv') # sometimes the simulation overshoots the end date
```

Now, we concatenate the 2019 logs remembering to add back the 2019 entries that got into the December 2018 simulation and clip the April 2019 entries from the March simulation:
```
logs_2019 = pd.concat([cat_csvs('logs/{}_2019.csv', 'datetime', ['jan', 'feb', 'mar']), logs_2018['2019']])
logs_2019['2019-Q1'].to_csv('logs/logs_2019.csv') # sometimes the simulation overshoots the end date
```

After we have the login attempts logs, we concatenate the 2018 hacker logs:
```
hackers_2018 = cat_csvs(
    'logs/hackers_{}_2018.csv', 'start', 
    ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']
)
hackers_2018['2018'].sort_index().to_csv('logs/hackers_2018.csv')
```

Concatenating the 2019 hacker logs is the same process:
```
hackers_2019 = pd.concat([
    cat_csvs('logs/hackers_{}_2019.csv', 'start', ['jan', 'feb', 'mar']), hackers_2018['2019']
])
hackers_2019['2019-Q1'].sort_index().to_csv('logs/hackers_2019.csv')
```

The process of building the CSV files from the individual simulations is contained in `merge_logs.py` and the entire process is in the bash script `run_simulations.sh`. You don't have to run either of these.

# Create SQLite Database

In [1]:
import sqlite3
import numpy as np
import pandas as pd

# read in files
logs_2018 = pd.read_csv('logs/logs_2018.csv', index_col='datetime')
logs_2019 = pd.read_csv('logs/logs_2019.csv', index_col='datetime')
hackers_2018 = pd.read_csv('logs/hackers_2018.csv', index_col='start')
hackers_2019 = pd.read_csv('logs/hackers_2019.csv', index_col='start')

# write to database
with sqlite3.connect('logs/logs.db') as conn:
    logs_2018.to_sql('logs', conn, if_exists='replace')
    logs_2019.to_sql('logs', conn, if_exists='append')
    hackers_2018.to_sql('attacks', conn, if_exists='replace')
    hackers_2019.to_sql('attacks', conn, if_exists='append')