# Simulating the data
Before we go into some exploratory data analysis, let's see how we simulated the data:

```
# example of how to run this simulation from jupyter (remove the ! to run from the command line)
!python3 simulate.py -s 0 --stealthy -l logs/jan_2018.csv -hl logs/hackers_jan_2018.csv 31 "2018-01-01" 0.01 0.5
```

Each month was simulated on its own with a different set of parameters:

|| Jan 2018 | Feb 2018 | Mar 2018 | Apr 2018 | May 2018 | Jun 2018 | Jul 2018 | Aug 2018 | Sep 2018 | Oct 2018 | Nov 2018 | Dec 2018 | Jan 2019 | Feb 2019 | Mar 2019 |
| --- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | 
| Probability of attack in a given hour | 1.00% | 0.50% | 0.10% | 1.00% | 0.01% | 0.05% | 1.00% | 0.50% | 0.50% | 0.20% | 0.70% | 1.00% | 0.80% | 0.20% | 1.00% |
| Probability of trying entire user base | 50% | 25% | 10% | 65% | 5% | 5% | 15% | 10% | 10% | 12% | 17% | 88% | 8% | 18% | 18% |
| Vary IP addresses? | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | Yes | Yes | Yes | Yes | Yes |

We use pandas to combine the files by year. First, we create a utility function for concatenating the files:

```python
import pandas as pd

def cat_csvs(format_string_file_pattern, index_col, month_list):
    """
    Utility function for concatentating CSV files from simulation.
    
    Parameters: 
        - format_string_file_pattern: The pattern for the file name with `{}` in the place of the month
        - index_col: The column with the datetimes to sort on.
        - month_list: The list of the months as formatted in the file names.
    
    Returns:
        A concatenated `pandas.DataFrame`
    """
    return pd.concat([
        pd.read_csv(
            format_string_file_pattern.format(file), index_col=index_col, parse_dates=True
        ) for file in month_list
    ]).sort_index()
```

and a second utility function to handle any spillover from one simulation to the next:
```python
def get_spillover(data, when):
    """Returns data from spillover"""
    try:
        return data.loc[when]
    except KeyError:
        return pd.DataFrame()
```

Next, we concatenate the 2018 logs making sure to not record any data from early January 1, 2019 which may have been generated from the Poisson process in December 2018:
```python
logs_2018 = cat_csvs(
    'logs/{}_2018.csv', 'datetime', 
    ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']
)
logs_2018.loc['2018'].to_csv('logs/logs_2018.csv') # sometimes the simulation overshoots the end date
```

Now, we concatenate the 2019 logs remembering to add back the 2019 entries that got into the December 2018 simulation and clip the April 2019 entries from the March simulation:
```python
logs_2019 = pd.concat([
    cat_csvs('logs/{}_2019.csv', 'datetime', ['jan', 'feb', 'mar']), 
    get_spillover(logs_2018, '2019')
]).sort_index()
logs_2019.loc['2019-Q1'].to_csv('logs/logs_2019.csv') # sometimes the simulation overshoots the end date
```

After we have the login attempts logs, we concatenate the 2018 hacker logs:
```python
hackers_2018 = cat_csvs(
    'logs/hackers_{}_2018.csv', 'start', 
    ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']
)
hackers_2018.loc['2018'].to_csv('logs/hackers_2018.csv')
```

Concatenating the 2019 hacker logs is the same process:
```python
hackers_2019 = pd.concat([
    cat_csvs('logs/hackers_{}_2019.csv', 'start', ['jan', 'feb', 'mar']), get_spillover(hackers_2018, '2019')
]).sort_index()
hackers_2019.loc['2019-Q1'].to_csv('logs/hackers_2019.csv')
```

The process of building the CSV files from the individual simulations is contained in `merge_logs.py` and the entire process is in the bash script `run_simulations.sh`. You don't have to run either of these.

# Create SQLite Database

In [1]:
import sqlite3
import numpy as np
import pandas as pd

# read in files
logs_2018 = pd.read_csv('logs/logs_2018.csv', index_col='datetime')
logs_2019 = pd.read_csv('logs/logs_2019.csv', index_col='datetime')
hackers_2018 = pd.read_csv('logs/hackers_2018.csv', index_col='start')
hackers_2019 = pd.read_csv('logs/hackers_2019.csv', index_col='start')

# write to database
with sqlite3.connect('logs/logs.db') as conn:
    logs_2018.to_sql('logs', conn, if_exists='replace')
    logs_2019.to_sql('logs', conn, if_exists='append')
    hackers_2018.to_sql('attacks', conn, if_exists='replace')
    hackers_2019.to_sql('attacks', conn, if_exists='append')

<hr>
<div style="overflow: hidden; margin-bottom: 10px;">
    <div style="float: left;">
        <a href="../../ch_10/red_wine.ipynb">
            <button>&#8592; Chapter 10</button>
        </a>
    </div>
    <div style="float: right;">
        <a href="./1-EDA_unlabeled_data.ipynb">
            <button>Next Notebook &#8594;</button>
        </a>
    </div>
</div>
<hr>