# Cyber Security Breaches
In this prject, I am going to analyze common cyber security breaches in US in the period 2007-2014.<br> You can find the dataset at [Kaggle](https://www.kaggle.com/alukosayoenoch/cyber-security-breaches-data).
#### By: Sumaya Altamimi

## Exploratory Analysis
Exploration starts with univariate visualizations to identify trends in distribution and outliers in single variables. Bivariate visualizations follow, to show relationships between variables in the data. Finally, multivariate visualization techniques are presented to identify complex relationships between three or more variables at the same time.

- Look for relationships 
- connect questions about data
- visualization don't need to be perfect


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

In [None]:
df = pd.read_csv('../input/cyber-security-breaches-data/Cyber Security Breaches.csv')
cols = ['Type_of_Breach','Summary','State','Individuals_Affected','Name_of_Covered_Entity'
        ,'Location_of_Breached_Information','year','Date_Posted_or_Updated']

In [None]:
year_order = df['year'].value_counts().index
state_order = df['State'].value_counts().index
attack_order = df['Type_of_Breach'].value_counts().index

In [None]:
df = df[cols].reset_index(drop=True)
df =df.loc[df['year']>2006]
df =df.loc[df['State'].isin(state_order[:20])]
# breaches = df.Type_of_Breach.value_counts()[:6].index
df = df.loc[df['Type_of_Breach'].isin(attack_order[:6])]
df['Summary'] = df['Summary'].fillna('No Summary')

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

# Explore

## Univariate

- Which year has the highest number of attacks?
- What is the most common cyber-attack?
- What is the most reported cyber-attack state?

In [None]:
base_color = sb.color_palette()[0]
year_order = df['year'].value_counts().index
state_order = df['State'].value_counts().index
attack_order = df['Type_of_Breach'].value_counts().index

In [None]:
plt.figure(figsize=(20,8))
sb.countplot(data=df,y='year',color=base_color,order=year_order)
plt.xticks(rotation=45);

### 2013 has the  highest number of attacks

In [None]:
plt.figure(figsize=(10,6))
sb.countplot(data=df,y='Type_of_Breach',color=base_color,order=attack_order)
plt.xticks(rotation=15);

### Theft is the most common attack

In [None]:
plt.figure(figsize=(20,8))
sb.countplot(data=df,x='State',color=base_color,order=state_order)
plt.xticks(rotation=45);

### California is the most reported cyber-attack state

## Bi-variate
- What is the relation between the year and Individuals Affected ?
- What is the relation between the year and the cyber-attack?

In [None]:
plt.figure(figsize=(20,8))
plt.scatter(data = df, y = 'Individuals_Affected', x = 'year', alpha=1/2)
plt.xticks(rotation=15);

In [None]:
plt.hist2d(data = df, y = 'Individuals_Affected', x = 'year',cmin=0.5,cmap = 'viridis_r')
plt.colorbar();

### Although no strong relationship, but tends to be positive relationship.

In [None]:
sb.violinplot(data=df, y='Type_of_Breach', x='year', color=base_color, innner=None,
              order=attack_order)
plt.xticks(rotation=15);

In [None]:
sb.boxplot(data=df, y='Type_of_Breach', x='year', color=base_color,order=attack_order)
plt.xticks(rotation=15);

In [None]:
plt.figure(figsize=(20,8))
sb.countplot(data = df, x = 'year', hue = 'Type_of_Breach',color=base_color,order=year_order);

In [None]:
ct_counts = df.groupby(['Type_of_Breach', 'year']).size()
ct_counts = ct_counts.reset_index(name='count')
ct_counts = ct_counts.pivot(index = 'Type_of_Breach', columns = 'year', values = 'count')
ct_counts.head()

In [None]:
sb.heatmap(ct_counts, annot = True, cmap = 'vlag_r', center = 0)
plt.xticks(rotation=15);

### It appears that the number of attacks increases with time in general.

## Multivariate

Are year, attack, and state related?

In [None]:
plt.figure(figsize=(20,8))
ax = sb.barplot(data = df, x = 'year', y = 'Individuals_Affected', 
                hue = 'Type_of_Breach',palette='viridis',order=year_order)
# ax.legend(loc = 8, ncol = 3, framealpha = 1, title = 'Type_of_Breach')
plt.xticks(rotation=15);

In [None]:
g = sb.FacetGrid(data = df, col = 'Type_of_Breach', height = 4,
                     col_wrap = 3)
g.map(plt.scatter,'year','Individuals_Affected');

g.set_ylabels('Individuals_Affected')
g.set_xlabels('year')
g.set_titles('{col_name}');
plt.yscale('log')

# Explain Phase

# US Cyber security breaches From 2007 To 2014
Main findings:
- **Theft** was the most frequent cyber-attack 
- **California** is the most reported cyber-attack state 
- **2013** is the most reported cyber-attack year 
- **2009-2013** is the period with high number of attacks, especially **Theft** as shown below
- Individuals affected by cyber-attacks is **increasing** each year year, However **2014** is exception.
-  Two types of attack are Theft is the cause of that increase! **Theft** and Hacking/IT incident.

### **Theft** was the most frequent cyber-attack in the period 2007-2014 as shown below:

In [None]:
sb.countplot(data=df,y='Type_of_Breach',color=base_color,order=attack_order)
plt.title('Common Cyber-attack in US')
plt.xlabel('Frequency')
plt.ylabel('Attack Type');

### **California** is the most reported cyber-attack state in the period 2007-2014 as shown below:

In [None]:
sb.countplot(data=df,x='State',color=base_color,order=state_order)
plt.xticks(rotation=45)
plt.title('Reported cyber-attack per state')
plt.xlabel('State')
plt.ylabel('Frequency');

### **2013** is the most reported cyber-attack year in the period 2007-2014 as shown below:

In [None]:
sb.countplot(data=df,y='year',color=base_color,order=year_order)
plt.xticks(rotation=45)
plt.title('Reported cyber-attack per year')
plt.xlabel('Frequency')
plt.ylabel('Year');

### Individuals affected by cyber-attacks is **increasing** each year year, However **2014** is exception, as shown below:

In [None]:
plt.hist2d(data = df, y = 'Individuals_Affected',x = 'year',cmin=0.1,cmap = 'vlag')
plt.colorbar(label='Frequency')
plt.title('Individuals Affected by cyber-attacks per year')
plt.xlabel('Year')
plt.ylabel('Individuals Affected')
plt.yscale('log');

### **2009-2013** is the period with high number of attacks, especially **Theft** as shown below:

In [None]:
ct_counts = df.groupby(['Type_of_Breach', 'year']).size()
ct_counts = ct_counts.reset_index(name='count')
ct_counts = ct_counts.pivot(index = 'Type_of_Breach', columns = 'year', values = 'count')
sb.heatmap(ct_counts, annot = True, cmap = 'vlag_r', 
           center = 0,cbar_kws = {'label' : 'Frequency'})
plt.xticks(rotation=15)
plt.title('Common cyber-attacks per year')
plt.xlabel('Year')
plt.ylabel('Cyber attacks');

### Interstingly, two types of attack are  **Theft** is the cause of that increase! Theft and Hacking/IT incident.

In [None]:
ax = sb.barplot(data = df, x = 'year', y = 'Individuals_Affected', 
                hue = 'Type_of_Breach',palette='viridis')
plt.xticks(rotation=15)
plt.legend(loc = 6, bbox_to_anchor = (1.0, 0.5)) # legend to right of figure
plt.xticks(rotation = 15)
plt.title('Individuals affected per year and cyber-attacks ')
plt.xlabel('Year')
plt.ylabel('Individuals Affected');