# Project 3: Milestone 2

Shreya Parjan

**Note:** Submission revised 12/1. Initially completed the week of 11/17.

Using the Twitter API tools, I’ll be collecting: User profiles (the tweets of each user). I will then extract hashtags from these tweets, in  order to identify common hashtags. We can then start following such hashtags, if they seem relevant to our project on activism in Chile. I’ll be analyzing some of the features of  each account: number of  followers,  number of tweets, number of following, etc. in order to compare which subjects are more popular, or more engaged with, etc. 


**Table of Contents**
1. [Part 1: Create table of hashtags across all accounts](#sec0)
2. [Part 2: Find top 10 hashtags across all accounts](#sec1)
3. [Part 3: Find top 10 hashtags for a specific account](#sec2)

## Part 1: Create table of hashtags across all accounts
We want to better understand 1) how many hashtags are used by the accounts we've followed and 2) how often these hashtags are used by each account
<a id="sec0"></a>

In [3]:
import json
import pandas as pd

We've already followed accounts of interest and extracted files with their tweets. Now, we extract content from these JSON files and append it to a list.

In [4]:
acctList = ['ClaudiaDides','Pa__tty','ValdebenitoNata','tv_monica','KarolCariola','gabrielboric','BeaSanchezYTu','LorenaPizarroS','GiorgioJackson',
           'camila_vallejo','Claudia_Mix','JorgeSharp','carmen_hertz','redolesoficial',
           'manugarpez','nanostern','jcoulon','labeasanchez','FelipeParadaM',
           'AlbertoMayol','PamJiles','DMatamala','CristobalYessen','danieljadue',
           'sarmiento510','IraciHassler','nataliacuevasg','mriesco','MauroMura11',
           'juan_urra','tomashirsch','monlaferte','emiliatijoux','MarianaLaActriz',
           'Jaime_Bassa','KenaLorenziniL','mirnaschindler','ale_injoque','mauricio_weibel','JParadaHoyl']
acctContent = []
for i in acctList:
    with open(i+"-timeline.json") as json_file:
        data = json.load(json_file)
        acctContent.append(data)

From each account's JSON content, we extract the hashtags used in that account's tweets

In [5]:
i = 0
acctTag= {}

#find unique hashtags in json file and count frequency
while i < len(acctContent):
    hashtags = {}
    j = 0
    while j < len(acctContent[i]):
        if acctContent[i][j]['entities']['hashtags'] != []:
            k = 0
            while k< len(acctContent[0][0]['entities']['hashtags']):
                if acctContent[i][j]['entities']['hashtags'][k]['text'] not in hashtags:
                    newTag = acctContent[i][j]['entities']['hashtags'][k]['text'].lower()
                    hashtags[newTag] = 1
                else:
                    hashtags[acctContent[i][j]['entities']['hashtags'][k]['text'].lower()] +=1
                k+=1  
        j+=1
    acctTag[acctList[i]] = hashtags
    i+=1

Having extracted the hashtags from each account's tweets, we structure a dataframe counting how many times all accounts use each hashtag.

In [12]:
df = pd.DataFrame(acctTag)
df = df.fillna(0)
df.head()

Unnamed: 0,ClaudiaDides,Pa__tty,ValdebenitoNata,tv_monica,KarolCariola,gabrielboric,BeaSanchezYTu,LorenaPizarroS,GiorgioJackson,camila_vallejo,...,tomashirsch,monlaferte,emiliatijoux,MarianaLaActriz,Jaime_Bassa,KenaLorenziniL,mirnaschindler,ale_injoque,mauricio_weibel,JParadaHoyl
08m,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1000días,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100añoscorvalán,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100añosenriquekirberg,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100añosvioleta,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We now convert our dataframe to a CSV.

In [7]:
df.to_csv ('hashtags.csv', header=True)

## Part 2: Find top 10 hashtags across all accounts
<a id="sec1"></a>

Now that we have all the hashtags and all the accounts, we can find what the top 10 hashtags across all these accounts are. The following code makes a counter-esque dictionary where the keys are each hashtag and the values are the count for how often that tag comes up summed across all accounts.

In [56]:
sumcol = [[],[]]
for i in range(df.shape[0]):
    sumcol[0].append(df.iloc[i].sum()) #append sum of each column (hashtag) first list in a list of lists
tags = list(df.index) #extract actual text of all hashtags
sumcol[1] = tags
freqdict = {tags[i]: sumcol[0][i] for i in range(len(tags))}  #create dictionary mapping hashtags to frequency
tagfreq = sorted(freqdict.items(), key=lambda x: x[1], reverse=True) #sort by most popular

In [66]:
dfsorted = pd.DataFrame(tagfreq, columns =['hashtag', 'count']) #create dataframe of hashtags and counts

In [85]:
dfsorted[0:10]

Unnamed: 0,Hashtag,Count
0,40horas,56.0
1,fb,44.0
2,chile,34.0
3,enacional,30.0
4,noestamosenguerra,30.0
5,aborto3causales,29.0
6,cambiodegabinete,27.0
7,asambleaconstituyente,25.0
8,nuevaconstitucion,25.0
9,toquedequeda,25.0


## Part 3: Find top 10 hashtags for a specific account
Given a specific account (referred to by its index in the list of accounts), we can find its top 10 hashtags with the tagCounter function.
<a id="sec2"></a>

In [126]:
def tagCounter(i):
    minidf = df.iloc[:,[i]] #make mini dataframe for how often specified account uses all hashtags 
    minidf.sort_values(by=acctList[i], inplace=True, ascending=False) #sort the mini df
    newdf = minidf[0:10].reset_index() #make a new dataframe of the top 10 hashtags
    newdf.rename(columns = {'index':'hashtag',acctList[i]:'count'}, inplace = True)
    return acctList[i],newdf

In [130]:
print("Top 10 hashtags for @"+tagCounter(0)[0]+" are:")
tagCounter(0)[1]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Top 10 hashtags for @ClaudiaDides are:


Unnamed: 0,hashtag,count
0,fridaysforfuture,3.0
1,toquedequeda,3.0
2,40horas,3.0
3,11deseptiembre,2.0
4,climatestrike,2.0
5,cop25chile,2.0
6,agua,2.0
7,rodeo,2.0
8,11septiembre,2.0
9,agujerodeozono,1.0


For instance, in this example, we see that @ClaudiaDides talks most about the curfew (#toquedequeda) in Chile, key dates like the 9/11/73 coup de etat there, and climate issues.