### Coursera Learn SQL Basics for Data Science Specialization ###
### Capstone Milestone 1, project proposal and data selection / preparation.  ###
### Steve Schluchter ###



# Step 1: Perparing for your proposal

# You will document your preparation in developing the project proposal.  This includes:

## 1.) Which client/dataset did you select and why?

I chose the Lobbyists4America (Congressional tweets dataset).  I did this because of the opportunity to analyze some twitter data, and to see of there was a chance to develop a robust social network for the purpose of doing social network analysis.  I also wanted to explore what it meant to be 'influential' in this setting.

## 2.) Describe the steps you took to import and clean the data.

I chose to use pyspark as my first portal to the data.  I'm using pyspark 3.5.1.
I also ported the data pandas dataframes to make it easier to apply regular expressions to the text and extract hashtags and handles, etc.
To my knowledge, Pandas seemingly also makes it easier to compute summary statistics and to do basic analysis.

### The following relates to the tweet data and not the user data.

I started cleaning the data by determining which of the fields might actually be useful in that they contained data that could be useful.  A few of the fields didn't contain more than one distinct value (NULL in some cases).  Some of the data had fields that were unusable in that the urls contained were seemingly dead ends.  Some of the fields were also duplicates of others: there were some columns that were recasts of others as strings.  I thought that geolocational information and some other kinds of data, while interesting, were not seemingly applicable as an indicator of popularity: language, witholding, etc..  

I think the in_reply_to_user_id field is seemingly very useful as a measure of who it is that is ginning up conversations.

I also chose to extract hashtags and twitter handles from the tweets themselves, and to store them with the tweet data.

I chose to keep most of what I thought was information that could identify user accounts: screennames, ids, etc..

There is the case of the 'source' field, which contains different data and still seems useless.  The rows in this column contain source code for websites that seemingly have nothing usable in them.

I chose to keep the following fields in the tweets data:  entities, extended_entities, favorite_count, favorited, geo, id, in_reply_to_screen_name, in_reply_to_user_id, is_quote_status, retweeted_count, retweeted, screen_name, text, withheld_in_countries, withheld_scope.

I made a point to desearlize the object data where I thought it was appropriate and store it as separate dataframes.  The same goes for the user data.

### The following applies to the user data and not the tweet data.

I chose to keep the following fields for further analysis for reasons similar to the reasons I used to make decisions about the twitter data: description, entities, favourites_count, followers_count, geo_enabled, is_translator, location, name, screen_name, verified.

## 3.) Perform initial exploration of data and provide some screenshots or display some stats of the data you are looking at.

See below.

## 4.) Create an ERD or proposed ERD to show the relationships of the data you are exploring.

I've included the png files here.

In [None]:
from IPython import display

display.Image('./tweet.png')


In [None]:
display.Image('./user.png')

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Steve's congressional tweet session")\
                            .config("spark.driver.memory", "5g")\
                            .config("spark.driver.cores", '4')\
                            .getOrCreate()

In [None]:
tweets_df = spark.read.json('../capstone_data/tweets.json')
users_df = spark.read.json('../capstone_data/users.json')

In [None]:
tweets_df.show(10)  

In [None]:
#display('contributors')
tweets_df.select("contributors").distinct().show(truncate=False)
#display('coordinates')
tweets_df.select('coordinates').distinct().show(truncate=False)
#display('entities')
tweets_df.select("entities").distinct().show(truncate=False)
#display('extended_entities')
tweets_df.select(['extended_entities']).distinct().show(truncate=False)
#display('favorite_count')
tweets_df.select(['favorite_count']).distinct().show(truncate=False)
#display('favorited')
tweets_df.select(['favorited']).distinct().show(truncate=False)
#display('in_reply_to_screen_name')
tweets_df.select(['in_reply_to_screen_name']).distinct().show(truncate=False)
#display('in_reply_to_status_id')
tweets_df.select('in_reply_to_status_id').distinct().show(truncate=False)
#display('in_reply_to_user_id')
tweets_df.select('in_reply_to_user_id').distinct().show(truncate=False)
#display('is_quote_status')
tweets_df.select('is_quote_status').distinct().show(truncate=False)
#display('lang')
tweets_df.select('lang').distinct().show(truncate=False)
#display('place')
tweets_df.select('place').distinct().show(truncate=False)
#display('retweet_count')
tweets_df.select('retweet_count').distinct().show()
#display('retweeted')
tweets_df.select('retweeted').distinct().show(truncate=False)
#display('screenname')
tweets_df.select('screen_name').distinct().show(truncate=False)
#display('source')
tweets_df.select('source').distinct().show(truncate=False)
#display('withheld_copyright')
tweets_df.select('withheld_copyright').distinct().show(truncate=False)
#display('withheld_in_countries')
tweets_df.select('withheld_in_countries').distinct().show(truncate=False)
#display('withheld_scope')
tweets_df.select('withheld_scope').distinct().show(truncate=False)
#display('text')
tweets_df.select('text').distinct().show(truncate=False)

In [None]:
from pyspark.sql.functions import from_unixtime

print(tweets_df.select('screen_name').distinct().count())
print(tweets_df.count())
print('duh!')
print(tweets_df.select(from_unixtime('created_at').alias('datetime')).agg({'datetime':'min'}).show())
print(tweets_df.select(from_unixtime('created_at').alias('datetime')).agg({'datetime':'max'}).show())

print('finished')



In [8]:
tweets_df = tweets_df.select(['favorite_count', 'id', 'in_reply_to_screen_name', 'in_reply_to_status_id', 'in_reply_to_user_id', \
'quoted_status_id','retweet_count','screen_name','text', 'user_id','created_at'])

In [None]:
import pandas as pd

tweets_pd_df = tweets_df.toPandas()

In [10]:
import re

hashtags = []
handles = []

for tweet in tweets_pd_df['text']:
    hashtags.append([x.group() for x in re.finditer( r'#\w+', tweet)])
    handles.append([x.group() for x in re.finditer( r'@\w+', tweet)])

tweets_pd_df['hashtags'] = hashtags
tweets_pd_df['handles'] = handles
    

In [None]:
tweets_pd_df.info()
tweets_pd_df.head(10)

In [None]:


"""
display('contributors enables')
users_df.select('contributors_enabled').distinct().show(truncate=False)
display('default profile')
users_df.select('default_profile').distinct().show(truncate=False)
display('description')
users_df.select('description').distinct().show(truncate=False)
display('entities')
users_df.select('entities').distinct().show(truncate=False)
display('follow_request_sent')
users_df.select('follow_request_sent').distinct().show(truncate=False)
display('following')
users_df.select('following').distinct().show(truncate=False)
display('is translation enabled')
users_df.select('is_translation_enabled').distinct().show(truncate=False)
display('is translator')
users_df.select('is_translator').distinct().show(truncate=False)
display('lang')
users_df.select('lang').distinct().show(truncate=False)
display('url')
users_df.select('url').show(10, truncate=False)
display('verified')
users_df.select('verified').distinct().show(10, truncate=False)

users_df.show(10)
"""

users_df = users_df.select(['description', 'following', 'followers_count', 'friends_count', 'id','listed_count','screen_name','statuses_count', 'verified'])
users_pd_df = users_df.toPandas()
users_pd_df.info()
users_df.show(10)


In [None]:
import inspect
import rich.syntax

import erdantic.examples.attrs


rich.syntax.Syntax(
    inspect.getsource(erdantic.examples.attrs), 
    "python",
    theme="default",
    line_numbers=True
)

In [14]:
                                                                                                               
from dataclasses import dataclass, field                                                                      
from datetime import datetime                                                                                 
from enum import Enum                                                                                         
from typing import List, Optional                                                                             
                                                                                                               
                                                                                                                  
class Alignment(str, Enum):                                                                                   
    LAWFUL_GOOD = "lawful_good"                                                                               
    NEUTRAL_GOOD = "neutral_good"                                                                             
    CHAOTIC_GOOD = "chaotic_good"                                                                             
    LAWFUL_NEUTRAL = "lawful_neutral"                                                                         
    TRUE_NEUTRAL = "true_neutral"                                                                             
    CHAOTIC_NEUTRAL = "chaotic_neutral"
    LAWFUL_EVIL = "lawful_evil"                                                                               
    NEUTRAL_EVIL = "neutral_evil"                                                                        
    CHAOTIC_EVIL = "chaotic_evil"                                                                             
                                                                                                              
                                                                                                               
@dataclass                                                                                                    
class Adventurer:                                                                                             
    """A person often late for dinner but with a tale or two to tell.                                         
                                                                                                               
    Attributes:                                                                                               
    name (str): Name of this adventurer                                                                   
    profession (str): Profession of this adventurer                                                       
    alignment (Alignment): Alignment of this adventurer                                                   
    level (int): Level of this adventurer                                                                     
    """                                                                                                       
                                                                                                                
    name: str                                                                                                 
    profession: str                                                                                           
    alignment: Alignment                                                                                      
    level: int = 1                                                                                            
                                                                                                                 
                                                                                                                 
@dataclass                                                                                                    
class QuestGiver:                                                                                             
    """A person who offers a task that needs completing.                                                                                                                                                                   
      Attributes:                                                                                               
      name (str): Name of this quest giver                                                                  
      faction (str): Faction that this quest giver belongs to                                               
      location (str): Location this quest giver can be found                                                
    """                                                                                                       

    name: str                                                                                                 
    faction: Optional[str] = None                                                                             
    location: str = "Adventurer's Guild"                                                                      
                                                                                                              
                                                                                                              
@dataclass                                                                                                    
class Quest:                                                                                                  
    """A task to complete, with some monetary reward.                                                         
                                                                                                               
    Attributes:                                                                                               
    name (str): Name by which this quest is referred to                                                   
    giver (QuestGiver): Person who offered the quest                                                      
    reward_gold (int): Amount of gold to be rewarded for quest completion                                 
    """                                                                                                       
                                                                                                        
    name: str                                                                                                 
    giver: QuestGiver                                                                                         
    reward_gold: int = 100                                                                                    
                                                                                                                 
                                                                                                                 
@dataclass                                                                                                    
class Party:                                                                                                  
     """A group of adventurers finding themselves doing and saying things altogether unexpected.               
                                                                                                                 
     Attributes:                                                                                               
     name (str): Name that party is known by                                                               
     formed_datetime (datetime): Timestamp of when the party was formed                                    
     members (List[Adventurer]): Adventurers that belong to this party                                     
     active_quest (Optional[Quest]): Current quest that party is actively tackling                         
     """                                                                                                       
                                                                                                                
     name: str                                                                                                 
     formed_datetime: datetime                                                                                 
     members: List[Adventurer] = field(default_factory=list)                                                   
     active_quest: Optional[Quest] = None

"""
0   description      548 non-null    object
 1   following        548 non-null    bool  
 2   followers_count  548 non-null    int64 
 3   friends_count    548 non-null    int64 
 4   id               548 non-null    int64 
 5   listed_count     548 non-null    int64 
 6   screen_name      548 non-null    object
 7   statuses_count   548 non-null    int64 
 8   verified         548 non-null    bool 
"""

"""
 0   favorite_count           1243370 non-null  int64  
 1   id                       1243370 non-null  int64  
 2   in_reply_to_screen_name  65411 non-null    object 
 3   in_reply_to_status_id    54146 non-null    float64
 4   in_reply_to_user_id      65411 non-null    float64
 5   quoted_status_id         56418 non-null    float64
 6   retweet_count            1243370 non-null  int64  
 7   screen_name              1243370 non-null  object 
 8   text                     1243370 non-null  object 
 9   truncated                1243370 non-null  bool   
 10  user_id                  1243370 non-null  int64  
 11  hashtags                 1243370 non-null  object 
 12  handles                  1243370 non-null  object
"""


@dataclass
class User:
    """
    a twitter user
    """
    description: List[str]
    following: bool
    followers_count: int
    friends_count:int
    id: int
    listed_count: int
    screen_name:str
    statuses_count: int
    verified: bool
    
    

@dataclass
class Tweet:
    """
    a tweet
    """
    favourite_count: int
    id: int
    in_reply_to_screen_name: bool
    in_reply_to_user_id: bool
    quoted_status_id: bool
    retweet_count: int
    screen_name: str
    text: str
    user_id: int
    hashtags: List[str]
    handles:  List[str]
    

In [None]:

import pandas as pd
from pyspark.sql import Row

spark_df = spark.createDataFrame([
    Row(a=1),
    Row(a=2),
    Row(a=3),
    Row(a=4),
    Row(a=5),
    Row(a=6),
    Row(a=7),
    Row(a=8),
    Row(a=9),
    Row(a=10)])

spark_df.show(10)

spark_df = spark.createDataFrame(pd.DataFrame({'a':[1,2,3,4,5,6,7,8,9,10]}))

spark_df.show(10)



In [16]:
from pyspark.sql import functions as F


joined_spark_df = tweets_df.select(['favorite_count','created_at','id','in_reply_to_screen_name','in_reply_to_status_id','in_reply_to_user_id','quoted_status_id','retweet_count','text','user_id']).join(users_df, tweets_df.user_id == users_df.id, 'inner')

In [None]:
joined_spark_df.show(10)

In [None]:

from pyspark.sql.functions import desc, count, countDistinct

#so, each user has only one screenname among the user data
users_df.select(['id','screen_name']).groupBy(['id','screen_name']).agg(count('screen_name')).select(countDistinct('count(screen_name)')).show(10)

#users_df.select(['id', 'screen_name','statuses_count']).orderBy(desc('statuses_count')).show(25)
#users_df.select(['id', 'screen_name'])

In [None]:
#There is only one screen name per twitter user in the twitter data
tweets_df.select(['user_id','screen_name']).groupBy(['user_id','screen_name']).agg(countDistinct('screen_name')).select(countDistinct('count(screen_name)')).show(10)


In [None]:
tweets_df.select(['id','user_id']).groupBy(['user_id']).agg(countDistinct('id')).sort('count(id)',ascending=False).show(10)

In [21]:
import pyspark.sql.functions as F
import pyspark.sql.types as T



tweets_df_year = tweets_df.withColumn('date', F.to_timestamp(tweets_df.created_at.cast(dataType=T.TimestampType()))).withColumn('year',F.year('date'))


In [None]:
tweets_df_by_year = tweets_df_year.select(['id','user_id','screen_name','year']).groupby(['user_id','year']).agg(count('id'))
tweets_df_by_year.sort(['user_id','year']).show(100)

In [None]:
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler

tweets_df_totals= tweets_df_year.select(['screen_name','id']).groupby(['screen_name']).agg(count('id'))
tweets_df_totals.sort('screen_name').show(10)
tweets_df_totals = tweets_df_totals.sort('screen_name')
print(tweets_df_totals.count())
retweets_df_totals = tweets_df_year.select(['screen_name','retweet_count']).groupby(['screen_name']).sum('retweet_count')
retweets_df_totals.sort('screen_name').show(10)

print(retweets_df_totals.count())
tweets_retweets_join = tweets_df_totals.join(retweets_df_totals,'screen_name').sort('screen_name').select(['count(id)','sum(retweet_count)'])
tweets_retweets_join.show(10)
print(tweets_retweets_join.stat.corr('count(id)','sum(retweet_count)'))


#pearsonCorr = Correlation.corr(tweets_retweets_join,'features','pearson').collect()[0][0]

In [None]:

from pyspark.sql.types import IntegerType

retweet_count_spark = tweets_df_year.select(['screen_name','retweet_count','year'])#.withColumn('retweets',F.col('retweet_count').cast(T.IntegerType()).alias('retweets'))#.groupby(['user_id']).agg(sum('retweets'))
#retweet_count_spark = retweet_count_spark.withColumn('retweets', retweet_count_spark["retweet_count"].cast(IntegerType()).alias('retweets'))#.groupby(['user_id']).agg(sum('retweets'))
#retweet_count_spark_by_year = retweet_count_spark.()
retweet_count_spark.show(10)
#retweet_count_spark.sort('retweets', ascending=False)
#retweet_count_spark.describe() 
pd_retweet_count_spark = retweet_count_spark.toPandas()
pd_retweet_count_spark.info()

In [None]:
pd_retweet_count_spark.groupby(['screen_name','year']).sum('retweet_count').head(20)

pd_retweet_count_year = pd_retweet_count_spark.groupby(['year','screen_name']).agg({'retweet_count':sum})

df = pd_retweet_count_year.sort_values(['year','retweet_count'],ascending=False)

df = df.reset_index()

#df = pd_retweet_count_year.to_frame()#pd_retweet_count_year.head(20)

print(df.head(100))

print(df['year'].nunique())

arr_retweets = [df[df['year']== x] for x in range(2008,2018)]  

print(arr_retweets)

In [None]:
print(pd_retweet_count_spark[['screen_name','retweet_count']].groupby(['screen_name']).sum('retweet_count').sort_values('retweet_count', ascending=False).head(20))

pandas_retweet_count_spark = pd_retweet_count_spark[['screen_name','retweet_count']].groupby(['screen_name']).sum('retweet_count').reset_index()

tweet_count = tweets_df_year.select(['screen_name','id','retweet_count']).toPandas()
#print(tweet_count.head(10))

print(pandas_retweet_count_spark['retweet_count'].head(10))



In [None]:
#print(tweet_count.info())
#print(tweet_count.head(10))

#print(tweet_count.groupby(['screen_name']).id.nunique())

pandas_tweet_count = tweet_count.groupby(['screen_name']).agg({'id':'count'})
print(pandas_tweet_count.head(10))

print(pandas_tweet_count['id'].reset_index()['id'].corr(pandas_retweet_count_spark['retweet_count']))

print(pd_retweet_count_spark.head(10))



In [None]:
arr_retweets[0].head(10)
#arr_retweets[9].head(10)

In [None]:
#users_df.show(10)
import seaborn as sns

pandas_user_metadata_df = users_df.select(['id','screen_name','followers_count','friends_count','listed_count','statuses_count']).toPandas()

#pandas_tweet_count = pandas_tweet_count.toPandas()
heat_map_data  = pandas_user_metadata_df.join(pandas_tweet_count,how='inner',on='screen_name',lsuffix='_left',rsuffix='_right')

heat_map_data = heat_map_data[['followers_count', 'friends_count','listed_count', 'statuses_count', 'id_right']]

#heat_map_data.head(10)

heat_map_matrix = heat_map_data.rename(columns = {'id_right':'retweet_count'}).corr()

print(heat_map_matrix)

sns.heatmap(heat_map_matrix)



#user_metadata_df.show(10)

In [None]:
tweets_pd_df.head(10)

from collections import defaultdict

all_hashtags_count = defaultdict(int)
all_handles_count = defaultdict(int)



all_hashtags = []
all_handles = []


for index, row in tweets_pd_df.iterrows():
    #print(x)
    #print("XXXXXX")
    hashtags = row['hashtags']
    handles = row['handles']
    for hashtag in hashtags:
        all_hashtags_count[hashtag] += 1
    for handle in handles:
        all_handles_count[handle[1:].strip()] += 1

        
    #print("YYYYY")


"""
for x in tweets_pd_df['hashtags']:
    if x != []:
        for hashtag in x:
            #print(hashtag)
            print(all_hashtags_count['hashtag'])
            all_hashtags_count[hashtag] = all_hashtags_count[hashtag] + 1



print(all_handles)
print(all_hashtags)
"""

In [None]:
print(len(all_handles_count))

for key, value in all_handles_count.items():
    print(key, ' ', value)
    

In [None]:
print(len(all_hashtags_count))

for key, value  in all_hashtags_count.items():
    print(key, ' ', value)



In [None]:
screen_names = list(set([x.lower() for x in users_df.select(['screen_name']).toPandas()['screen_name'].values]))
print(screen_names)

screen_name_counts = defaultdict(int)

for k, v in all_handles_count.items():
    
    if k.lower() in screen_names:
        
        screen_name_counts[k.lower()] += value


print(dict(screen_name_counts.items()))

df_screen_names_counts = pd.DataFrame(screen_name_counts.items(), columns=['screen_name','count'])


print(df_screen_names_counts.head(10))

df_screen_names_counts.sort_values(by=['count'], ascending=False, inplace=True)

df_screen_names_counts.head(15)

In [34]:
import networkx as nx
twitter_network = nx.Graph()


In [None]:
from textblob import TextBlob

tweets_pd_df['AB_retweeted'] = pd.to_numeric(tweets_pd_df['retweet_count'] > 0)

tweets_pd_df['sentiment'] = tweets_pd_df.text.apply(lambda text: TextBlob(text).sentiment[0])

tweets_pd_df.info()

tweets_pd_df.head(10)

In [None]:
tweets_pd_df['absolute_sentiement'] = abs(tweets_pd_df['sentiment'])
tweets_pd_df.head(10)
for i in range(0,10):
    print(i)
    tweets_pd_df[f"test_sentiment_{i}"] = (tweets_pd_df['absolute_sentiement'] > (0.1 * i)).astype(int)

tweets_pd_df['retweeted_int'] = (tweets_pd_df['retweet_count']).astype(int)
tweets_pd_df.head(10)

In [None]:
#tweets_pd_df.info()
#tweets_pd_df.head(10)
#tweets_pd_df['absolute_sentiement']

tweets_pd_df['has_hashtags'] = tweets_pd_df.hashtags.apply(lambda hashtags: len(hashtags) > 0)

print(tweets_pd_df['has_hashtags'])

#num_handles_and_retweets = tweets_pd_df[]

tweets_pd_df['has_hashtags'] = tweets_pd_df['has_hashtags'].astype('int')

print(tweets_pd_df['has_hashtags'])

To do an AB test for each i above, here's what we need to do.
1.) We need to fill out the box of 4 squares.

  a.) Retweeted and over threshhold.

  b.) Retweeted and not over threshhold.

  c.) Not retweeted and over threshhold.

  d.) Not retweeted and not over threshhold.

Take all 4 numbers for each and plug them into to the online calculator https://thumbtack.github.io/abba/demo/abba.html .


In [None]:
for i in range(1,10):
    number_sentimental = tweets_pd_df[tweets_pd_df[f"test_sentiment_{i}"] == 1].shape[0]
    number_not_sentimental = tweets_pd_df[tweets_pd_df[f"test_sentiment_{i}"] == 0].shape[0]
    retweeted_sentimental = tweets_pd_df[(tweets_pd_df[f"test_sentiment_{i}"] == 1) & (tweets_pd_df['AB_retweeted'] == True)].shape[0]
    retweeted_not_sentimental = tweets_pd_df[(tweets_pd_df[f"test_sentiment_{i}"] == 0) & (tweets_pd_df['AB_retweeted'] == True)].shape[0]
    not_retweeted_sentimental = tweets_pd_df[(tweets_pd_df[f"test_sentiment_{i}"] == 1) & (tweets_pd_df['AB_retweeted'] == False)].shape[0]
    not_retweeted_not_sentimental = tweets_pd_df[(tweets_pd_df[f"test_sentiment_{i}"] == 0) & (tweets_pd_df['AB_retweeted'] == False)].shape[0]
    print(f"Experiment {i}")
    print(f"Sentiment over {round(0.1 * i, 2)}: {number_sentimental}.")
    print(f"Sentiment not over {round(0.1 * i, 2)}: {number_not_sentimental}.")
    print(f"Total experiments {number_sentimental + number_not_sentimental}.")
    print(f"Retweeted and sentiment over {round(i * 0.1,2)}: {retweeted_sentimental}.")
    print(f"Retweeted and sentiment not over {round(i * 0.1,2)}: {retweeted_not_sentimental}.")
    print(f"Not retweeted and sentiment over {round(i * 0.1,2)}: {not_retweeted_sentimental}.")
    print(f"Not retweeted and sentiment not over {round(i * 0.1,2)}: {not_retweeted_not_sentimental}.")



In [None]:
retweeted_has_hashtags = tweets_pd_df[(tweets_pd_df['AB_retweeted'] == True) & tweets_pd_df['has_hashtags'] == 1].shape[0]
print(f" Retweeted has hashtags {retweeted_has_hashtags}.") 
retweeted_no_hashtags = tweets_pd_df[(tweets_pd_df['AB_retweeted'] == True) & tweets_pd_df['has_hashtags'] == 0].shape[0]
print(f" Retweeted no hashtags {retweeted_no_hashtags}.") 
not_retweeted_has_hashtags = tweets_pd_df[(tweets_pd_df['AB_retweeted'] == False) & tweets_pd_df['has_hashtags'] == 1].shape[0]
print(f" Not retweeted has hashtags {not_retweeted_has_hashtags}.") 
not_retweeted_no_hashtags = tweets_pd_df[(tweets_pd_df['AB_retweeted'] == False) & tweets_pd_df['has_hashtags'] == 0].shape[0]
print(f" Not retweeted has no hashtags {not_retweeted_no_hashtags}.")
print(f" Total retweeted {retweeted_has_hashtags + retweeted_no_hashtags}.")
print(f" Total not retweeted {not_retweeted_has_hashtags + not_retweeted_no_hashtags}")
print(f" Total with hashtags {retweeted_has_hashtags + not_retweeted_has_hashtags}")
print(f" Total no hashtags {retweeted_no_hashtags + not_retweeted_no_hashtags}.")


In [None]:
"""Adventurer
for i in range(tweets_pd_df.shape[0]):
    #print(i)
    #print(tweets_pd_df.iloc[i,12])
    #print(tweets_pd_df.iloc[i,2])
    if tweets_pd_df.iloc[i,2] == None:
        continue

    tweets_pd_df.iloc[i,12].append(tweets_pd_df.iloc[i,2])
"""



In [None]:
for i in range(tweets_pd_df.shape[0]):
    if tweets_pd_df.iloc[i,2] != None:
        print(tweets_pd_df.iloc[i,2])
        print(tweets_pd_df.iloc[i,12])
        break


In [42]:
import networkx as nx

twitter_network = nx.Graph()
directed_twitter_network = nx.DiGraph()

for index, row in tweets_pd_df.iterrows():
    for handle in row['handles']:
        twitter_network.add_edge(row['screen_name'],handle[1:])
        directed_twitter_network.add_edge(row['screen_name'],handle[1:])




In [None]:
for e in twitter_network.edges():
    print(e)

In [None]:
print(1)
deg_centrality = nx.degree_centrality(twitter_network)
print('centrality')
print(type(directed_twitter_network.in_degree(directed_twitter_network.nodes())))
#for deg in directed_twitter_network.in_degree(directed_twitter_network.nodes()):
#    print(deg)

for deg in sorted(directed_twitter_network.in_degree(directed_twitter_network.nodes()), key=lambda x: x[1], reverse=True)[:20]:
    print(deg)

print(2)
#bet_centrality = nx.betweenness_centrality(twitter_network)
#print(3)
#load_cent = nx.load_centrality(twitter_network)
#print(4)
#eig_cent = nx.eigenvector_centrality(twitter_network)
#print(5)
