## Case Study
Twitter is a massive platform.  There are 300+ million users on Twitter, and it is a source of information for current events, social movements and, financial information.  It has been shown in a number of cases that information from Twitter can mobilize a large number of individuals.  From #blacklivesmatter to other forms of *hashtag* activism, social media can play an important role in informing and mobilizing individuals.

This same activity can me extended to financial information.  The introduction of "cashtags" to twitter has allowed individuals to connect and discuss stocks, but it has also given stock promoters a method for promoting low value stocks, to "pump and dump".  Some researchers have analyzed the use of cashtags on Twitter.  We will use a similar method to look at the data, but we will ask a slightly different question.

### Reading
Hentschel M, Alonso O. 2014. Follow the money: A study of cashtags on Twitter. *First Monday*. URL: https://firstmonday.org/ojs/index.php/fm/article/view/5385/4109

#### Supplementary Information

* Evans, L., Owda, M., Crockett, K., & Vilas, A. F. (2019). A methodology for the resolution of cashtag collisions on Twitter–A natural language processing & data fusion approach. *Expert Systems with Applications*, **127**, 353-369.
* Evans, L., Owda, M., Crockett, K., & Vilas, A. F. (2021). [Credibility assessment of financial stock tweets](https://www.sciencedirect.com/science/article/pii/S0957417420310356). *Expert Systems with Applications*, **168**, 114351.
* Cresci, S., Lillo, F., Regoli, D., Tardelli, S., & Tesconi, M. (2019). Cashtag Piggybacking: Uncovering Spam and Bot Activity in Stock Microblogs on Twitter. *ACM Transactions on the Web (TWEB)*, **13(2)**, 11.

#### Raw Data source
I document the source of ticker data below.  The tweet data we use here comes from a dataset used in Cresci *et al* (2019) referenced above.  The data is available through Zenodo using the dataset's DOI: [10.5281/zenodo.2686861](https://doi.org/10.5281/zenodo.2686861)

### Formulating the question

The question we want to ask specifically is whether *cashtag frequency is tied to increases in stock price*.

To do this we need to know a few things.  First, we need to understand the frequency of cashtags, and classify them in some way.  What aspects of a cashtag are important?  What elements of a tweet containing a cashtag are important?  How do we go from raw cashtag to something we can analyze?

In addition, what other information do we need to help us understand our data?  How do we know stock prices?

## Questions

1.  Identify elements of the potential dataset(s) that match each of the four Vs of Big Data:

  a.  Velocity: 
  1. Stock trading information requires zetabyes of data to be processed in a day.
  2. stock option prediction algorithm requires very time sensative data and quick prediction on the fly.

  b.  Veracity: 
  1. issues resulting from wheather cashtags are double counted in retweets or replies; wheather mis-spelled cashtags can be identified.
  2. poor data quality results US economy loss in trillions of dollars a year.

  c.  Volume: 
  1. use cloud solution to efficiently perform calculation on Gigabytes or Terabytes of data;
  2. Local goverment has millions of paper documents that need to be scaned and processed using NLP.

  d.  Variety: 
  1. issues from stock symbol change;
  2. Healthcare used to keep records in sql server, excel, text file and paper binders.
  

2. To find the stock price at a point in time for a cashtag (e.g., $A), we need to know which company uses that NYSE listing, and then find the listing at that time period.  A public dataset of stock listings is available at [ftp.nasdaqtrader.com/SymbolDirectory/otherlisted.txt]().  

  a. You can download files directly into Python from an FTP service using the command FTP.  Read the file into Python and return the number of rows, and return the symbol of the stock ticker associated with the longest named entity.

In [1]:
import psycopg2
import os
from dotenv import load_dotenv

In [None]:
from ftplib import FTP
from io import StringIO
import csv

# The following lines are for your information.  they represent a recipe for downloading data directly from an FTP server:
session = FTP('ftp.nasdaqtrader.com')
session.login()
r = StringIO()
session.retrlines('RETR /SymbolDirectory/otherlisted.txt', lambda x: r.write(x+'\n'))

r.seek(0)
csvfile = list(csv.DictReader(r, delimiter='|'))

# From here apply your solution.


In [4]:
#output number of rows
n=len(csvfile);n

5782

In [20]:
longestname=csvfile[0]["Security Name"]

for i in range(n):
    if len(csvfile[i]["Security Name"])>len(longestname):
        a=i
        longestname=csvfile[i]["Security Name"]
csvfile[a]["ACT Symbol"]
    

'GJP'

2b.  In the database there is a table called `tickers`.  Connect to the database using Python.  Using a SQL query, return the number of rows in this table.

In [2]:
load_dotenv()
conString = {'host':os.environ.get('DB_HOST'),
             'dbname':os.environ.get('DB_NAME'),
             'user':os.environ.get('DB_USER'),
             'password':os.environ.get('DB_PASS'),
             'port':os.environ.get('DB_PORT')}
conn = psycopg2.connect(**conString)
cur = conn.cursor()
cur.execute("select count(*) from import.tickers")
nrow=cur.fetchone()
nrow

(5754,)

2c. Use a SQL query (with Python) to return the row with the longest company name.

In [22]:
#conn.close()
conn = psycopg2.connect(**conString)
cur = conn.cursor()
cur.execute("""SELECT *
FROM import.tickers
ORDER BY length(securityname) desc
Limit 1""")
cur.fetchone()

('GJP',
 'Synthetic Fixed-Income Securities, Inc. Synthetic Fixed-Income Securities, Inc. on behalf of STRATS (SM) Trust for Dominion Resources, Inc. Securities, Series 2005-6, Floating Rate Structured Repackaged Asset-Backed Trust Securities (STRATS) Certificates',
 'N',
 'GJP',
 'N',
 '100',
 'N',
 'GJP')


3. The output of an individual tweet may be a complex object, returned from the Twitter API.  This data is stored as a JSON object within a Postgres database in the cloud.  The table is in a schema called `import` in a table called `tweets`.  Connect to the database.  How many individual tweets are in our dataset?

In [None]:
#conn.close()
#conn = psycopg2.connect(**conString)
#cur = conn.cursor()
cur.execute("select count(distinct id) from import.tweets")
nrow=cur.fetchone()
nrow
cur.close()
conn.close()

4. Currently all columns in the table `import.tweets` are coded as Postgres `text` columns.  How would you normalize these tables?  Use [`CREATE TABLE IF EXISTS`]() to generate the appropriate tables in the `import` schema.  What do the normalized tables look like? You should add PRIMARY and FOREIGN key restraints, but do not need to add indexes or other constraints.

In [None]:
#create a users table containing userid data
cur.execute("""CREATE TABLE IF NOT EXISTS as import.users (
userid numeric PRIMARY KEY);) """)

#create a tweets table containing tweet id, content, retweet and reply.
#use foreign keys to point to the same tweets table to create links between tweets,retweets and replies.
cur.execute("""CREATE TABLE IF NOT EXISTS as import.tweets (
id numeric PRIMARY KEY,
content text,
inreplytostatusid numeric REFERENCES schema.tweet(id),
inreplytouserid numeric REFERENCES schema.tweet(userid),
retweetedstatusid numeric REFERENCES schema.tweet(id),
retweeteduserid numeric REFERENCES schema.tweet(userid),
lang text,
source text,
createdat timestamp,
userid numeric REFERENCES schema.user(userid));) """)

