Draw together some large-scale data from open sources and use that to start to make an argument about what the emerging technological security threats are, and where we should locate an office for investigating these. This is a question of topic and of place. 

Obviously, at this stage we expect a thorough job from you. Minimally, we want something that draws on a single large repository of data (numerical, textual, network, geocoded, etc.). But since the greatest insights often come from multiple data sources, if you can join these together and find relevant connections and use this preliminary data analysis to form some solid hunches for more pointed data collection and analysis, that would be the kind of work we would expect from a professional in the field.


Topic: Emerging Technological security threats
Place: Physical location where Agent Smith should locate an office for investigating emerging technological security threats.

1. A set of research questions. These may be unfinished at this stage, but should be questions that are narrow enough that they can be answered with data that is available, and precise enough that you can identify differences in data that will help to answer them.


Research questions:
* Are there any geographic regions or industries that are consistently identified as being at higher risk for cybersecurity threats on r/cybersecurity, and if so, why?

* Are there any geographic trends or patterns in the types of cybersecurity incidents that are reported on r/cybersecurity, such as data breaches, malware attacks, or phishing campaigns?

* What are the most commonly discussed cybersecurity threats on r/cybersecurity, and do they vary by region or industry?

subreddits to search:

r/cybersecurity
r/netsec

In [33]:
# !pip install psaw
# !pip install praw
# !pip install networkx



In [47]:
import requests
import json
import pandas as pd
import praw
from psaw import PushshiftAPI
from datetime import datetime
import secret

import datetime as dt            #library for date management
import matplotlib.pyplot as plt  #library for plotting




reddit = praw.Reddit(user_agent="ASU STC-510 Data Wrangling Basics (by Trioptre)",
                     client_id=secret.app_id,
                     client_secret=secret.app_secret,
                     username=secret.uname,
                     password=secret.upass)

# PushshiftAPI was giving me two errors:

# /Users/david/opt/anaconda3/lib/python3.9/site-packages/psaw/PushshiftAPI.py:192: UserWarning: Got non 200 code 404
#   warnings.warn("Got non 200 code %s" % response.status_code)
# /Users/david/opt/anaconda3/lib/python3.9/site-packages/psaw/PushshiftAPI.py:180: UserWarning: Unable to connect to pushshift.io. Retrying after backoff.
#   warnings.warn("Unable to connect to pushshift.io. Retrying after backoff.")

# api = PushshiftAPI(reddit)


# Commenting the following out due to issues with the PushshiftAPI
# early_subs = []
# search_args = {
#     'q': 'zero day',
#     'after': int(datetime(2022,2,1).timestamp()),
#     'before': int(datetime(2022,3,1).timestamp()),
#     'limit':1000,
# # }

# early_subs += list(api.search_submissions(**search_args))


# early_subs = []
# for submission in reddit.subreddit('cybersecurity').search('zero day', time_filter='month', limit=1000):
#     early_subs.append(submission)

# topiclist = []
# for submission in early_subs:
#     topiclist.append(submission.id)
#     print(submission.title, submission.id)

# print(topiclist)


search_args = {
    'q': 'zero day',
    'time_filter': 'month',
    'limit': 1000,
}

early_subs = []
for submission in reddit.subreddit('cybersecurity').search(search_args['q'], time_filter=search_args['time_filter'], limit=search_args['limit']):
    early_subs.append(submission)


    early_subs

len(early_subs)

[eachsub.title for eachsub in early_subs]


Search terms:
cyberespionage
state-sponsored attacks
international sanctions
malware
ransomware
phishing
data breach
zero-day vulnerability
Advanced Persistent Threat (APT)



What kind of data to I have? (Numerical, textual, network, geocoded, etc.)

- json data

Lu, Y., Zhu, W., & Liu, P. (2020). Deep learning for cybersecurity: A survey. IEEE Communications Surveys & Tutorials, 22(3), 1593-1633.

The article by Lu et al. provides a comprehensive survey of the use of deep learning techniques in cybersecurity. The authors review the current state-of-the-art in the application of deep learning to various cybersecurity tasks, such as intrusion detection and malware analysis. They also discuss the challenges and opportunities of using deep learning in cybersecurity and identify some future research directions in this area.

-------

Alhadidi, D., Al-Dhaheri, M., Alkhouri, S., & Aloul, F. (2019). A survey on ransomware: Threats, vulnerabilities, and prevention. Computers & Security, 83, 81-105.

The article by Alhadidi et al. provides a comprehensive survey of the ransomware threat, which is a type of malware that encrypts user data and demands a ransom for its decryption. The authors review the current state-of-the-art in ransomware attacks, such as distribution methods and encryption techniques. They also discuss the vulnerabilities that make systems susceptible to ransomware attacks and propose some prevention techniques that can help mitigate the impact of ransomware.

--------

Amin, R., Chen, J., & Ganesan, D. (2019). A survey of security and privacy issues in smart grid. IEEE Communications Surveys & Tutorials, 21(3), 2526-2553.

The authors provide a comprehensive survey of the security and privacy issues in smart grid systems. The authors review the current state-of-the-art in smart grid security, such as threats and vulnerabilities to the system's communication and control infrastructure. They also discuss the privacy challenges associated with the collection and dissemination of smart grid data and propose some security and privacy solutions that can help address these challenges.2. While you need not write a literature review, you should make reference to at least three scholarly articles/books that help you to frame your work. Well written articles provide a heuristic function: suggesting "known unknowns" worth investigating. Likewise, there is value in replicating investigations, particularly when moved to different contexts: does research on one social platform or domain apply to others? Articles that provide strong methods sections may serve as a template for your own work.






Kim, S., & Choi, Y. (2020). Security threats in the internet of things (IoT) era. Information Systems Frontiers, 22(2), 279-291. doi: 10.1007/s10796-019-09972-1

Kim and Choi's article discusses the security threats associated with the Internet of Things (IoT), a network of physical devices connected to the internet. The authors highlight the need for increased security measures in the design and implementation of IoT devices and systems, as well as the importance of educating users about the risks associated with IoT.

--------

Cheng, C., & Furnell, S. (2019). Mobile device security: A review of risks and challenges. Journal of Information Privacy and Security, 15(1), 1-17. doi: 10.1080/15536548.2018.1543955

Cheng and Furnell's article provides an overview of the security risks associated with mobile devices, including malware, data breaches, and social engineering attacks. The authors highlight the need for increased user awareness and education about mobile device security, as well as the importance of implementing effective security measures at the device and network levels.

--------

Jansen, W. A., & Scarfone, K. (2013). Guidelines for securing wireless local area networks (WLANs). National Institute of Standards and Technology, Special Publication 800-153.

Jansen and Scarfone's article provides guidelines for securing wireless local area networks (WLANs), which are commonly used to provide wireless internet access in homes, businesses, and public spaces. The authors highlight the security risks associated with WLANs, including eavesdropping, unauthorized access, and denial-of-service attacks. The guidelines provide recommendations for securing WLANs at the network, device, and user levels.




3. Your data needs to be assembled. This may draw on existing data sources (e.g., US Census data), or may use data you have assembled yourself. In any case, the work should not merely replicate existing research on an existing collection. Make sure, if you are using data that is already largely prepared for you, that you are creating new insights in your analysis. The data you use should be large enough that your pilot project will produce indicative insights, and is scalable as required.

In [None]:
# r/cybersecurity 

4. Place the data in an appropriate format: in a networkx, numpy array, pandas/ geopandas data structure, or other structure that is appropriate for your analysis. Clean the data--looking for missing data, outliers, problems with the ways in which it is presented, etc. Exclude cases or elements that are not important to your analysis.

5. Find ways of combining and viewing the data that will help you to address your research questions. This might include looking at KWICs, (as I did above), plotting frequencies, binning and producing histograms, scatterplots, or exploring network structures. Again, your choice of approaches here is likely to be determined largely by your research questions. You are welcome to use the tools we have used in our lessons, but you may find there are approaches that fall outside those we've taken here that can help you to answer your questions.

6. Test any initial "hunches." Is there a causal connection between components? Are there scales that may reveal themselves to be connected in a regression? Does classifying cases get you somewhere useful? Is there explicit or implied metadata that can help you answer your questions.

7. Note what you have discovered, and whether it reveals a "hunch" that is solid enough to collect more data and prove up.