# Twitter Open Source Intelligence Pipeline 🐦🔎

The purpose of this project is to create a data pipeline for intel analyst, looking <br>
to expedite the use of open source twitter data. The specific use case that is being <br>
presented here, is to collect open source images of maritime vessels in an effort <br>
to verify the position of vessels based on an image being capture while they are <br>
close to shore.

In [None]:
This pipeline achomplishes:
    - Twitter data collection based on hashtag
    - Downloading of images related to collected tweets
    - Verification and classification of boats in images
    - Cleaning of the dataset to only, 

## Let start by setting up the enviroment!

In [None]:
def install_req():
    os.chdir('..')
    path = os.getcwd() + '/requirements.txt'
    cmd_dl = 'pip install -r ' + path
    os.system(cmd_dl)

In [None]:
import os
install_req()

## Setup Continued, importing all the modules.

In [None]:
import sys
path = os.getcwd()
sys.path.append(path) #Set the system path for imports
from twitter.twitter_scrape import * # Scraping tool
from postgree.df_sql_utils import * # Postgress utilities
from postgree.df_sql_utils import main as sql_main # Sql main function
from twitter.twitter_scrape import main as tws_main # Scraping main
from aws_custom.aws_file_upload import main as aws_up_main # AWS fileuploader
from aws_custom.aws_config import * #AWS Config
from aws_custom.aws_model_utils import * # Utils for running inference

In [None]:
path = os.getcwd() 
path_down = path + '/download/'
path_up = path + '/upload/'

Set Up Postgree Database

In [None]:
conn = conn # Establishing a connection to the postgrees server
create_db(conn) #If the database isnt created, create one
create_table() #If the table isnt created, create one

## Scrape Twitter for Tweet with specific Hashtag

In [None]:
df = tws_main()

In the tws_main function we are using the tweepy twitter api tool to query twitters <br>
recent tweets under the common accounts and hashtags that intel analysts use <br>
more specifically we are searching the hashtags 'Shipping', 'Shipsinpics', and 'Ship'. <br>
There is a large amount of different data, but the data of interest is the id, text, <br>
tweet creation time, and the url of the image related to the tweet. Additionally <br>
this function provides functionalility to automatically filter tweets that do not <br>
contain images. Lastly the the tweets are packaged into a dataframe.

## Download the images associated with Tweets

In [None]:
download_image(df)

The free twitter api does not contain the functionality to download images related <br>
to tweets, this function takes care of this by downloading all the images in our tweet dataframe. These tweets are saved in the downloads folder, named with the id of the <br>
tweets. 

## Filter for tweets that only pertain to boats

In [None]:
image_files = list_images() # List path to all the images
delete_list = check_images(image_files, END_POINT)

To assess the validity of the tweets that were collected we are going to send the <br>
collected images against the boat classifcation model, if the image does has a 'non boat' label score greater than 80% then it is marked to be removed from the directory of images, and its <br> associated tweet. <br> 


## Move all Images that contain boats, and create new data frame with only viable tweets


In [None]:
df_final = remove_from_df(df, delete_list)
delete_non_boats(delete_list)

In this step of the pipeline the images, and tweets that are unrelated to boats are removed. <br>

## Upload Images to S3 Bucket for later use

In [None]:
move_for_upload(path_down,path_up)
aws_up_main(path_up,'twitter-osint') 
delete_temp_folders(path_down, path_up)

Since the end of this pipeline is to feed the collected boat information into a GUI for analyst <br> we are going to upload all the images of boats the an AWS S3 Bucket. This uses the AWS CLI <br> to properally connect to AWS. Lastly in this step we are cleaning up the temporary folders <br> that were used in the download and upload process.


## Insert Tweets into a Postgrees Database

In [None]:
sql_main(df)

Lastly the tweets are formatted to be intserted into a postgrees database, where they can be <br>
retrieved by tweet id.

## Lessons Learned

- APIs come with an extreme amount of over head for set up and learning. Some of them can <br>
essentially be like learning a new language.
- If there is something that you are wanting to do with python there is likely a package for it <br>
- Although a workflow sounds incredibly simple at first, the problem is always more complex than you would expect. <br>
- Before starting to code, it is always helpful to start with a literature review of the major <br> tools that you want use. I wish we would have done this with the Tweepy, the python <br>
package for the Twitter API.
- The process of making code generalizable, where it isnt overly dependent on just being run <br> on your machine is alot of work. 
