# Analyze Changes in Traffic

### Initial Setup 
- This script uses the GSC API 
- If you don't have your GSC API Key, watch this video from Jean Chouinard first: https://www.youtube.com/watch?v=-uy4L4P1Ujs&t=4s
- This code was built off June Tao Ching's GSC code and was modified to audit my content accordingly
    - Source: https://towardsdatascience.com/access-google-search-console-data-on-your-site-by-using-python-3c8c8079d6f8

### What does the code do?

- The following code pulls Keyword and URL Level Data from GSC for 2 different date ranges 
- Then it joins the data together and subtracts clicks, avg. position and impressions at the URL and KW level
- Next, we take a look at Keywords, subfolders and URLs that saw the largest drop in traffic
- We will then run an Ngram analysis on keywords that saw the largest drop in traffic
- Bonus: If you want to dive a little deeper, you can run a screaming frog crawl to get titles, H-1s and h-2s for urls with the largest traffic drops and then run an ngram analysis on the titles to find patterns amongst your traffic losers  (scroll to the bottom of my code for this analysis)



### What is an Ngram Analysis?
- "An n-gram is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation." - https://www.mathworks.com/discovery/ngram.html

### How can Ngrams be used for SEO?
- You can use Ngrams to analyze keywords to find patterns patterns, topic cluster / internal linking ideas 
- Ngrams can used to analyzeyour page titles, urls, keywords, H-1s, anchor text (internal and external) to find specific patterns 

#### Other Examples:
- Ngrams on GSC Keywords to find Topic Clusters / Internal Linking opportunities
- Ngrams on competitor Keyword data / titles / H-1s to find other keywords / topics to build out
- Ngrams on competitors titles (best by links (ahrefs report)) to determine which pages are generating the most backlinks
- Ngrams on GSC Keywords / page titles to analyze the impact of an algorithm update 

In [None]:
class gsc_api:

    def __init__(self,website,start_date,end_date):

        self.website = website
        self.start_date = start_date
        self.end_date = end_date

    

    def url_level_data(self):

        SITE_URL = self.website

        OAUTH_SCOPE = ('https://www.googleapis.com/auth/webmasters.readonly', 'https://www.googleapis.com/auth/webmasters')

        # Redirect URI for installed apps
        REDIRECT_URI = 'urn:ietf:wg:oauth:2.0:oob'
        
        
        # You must edit gsc_credentials and pickled_credentials to include YOUR username
        gsc_credentials = r'your credentials '
        
        # where your pickled credential will be stored

        pickled_credentials = r'c:\users\your_username\desktop\pickled_credential'


        try:
            credentials = pickle.load(open(pickled_credentials  + ".pickle", "rb"))
        except (OSError, IOError) as e:
            flow = InstalledAppFlow.from_client_secrets_file(gsc_credentials, scopes=OAUTH_SCOPE)
            credentials = flow.run_console()
            pickle.dump(credentials, open(pickled_credentials  + ".pickle", "wb"))

            # Connect to Search Console Service using the credentials 
        webmasters_service = build('webmasters', 'v3', credentials=credentials)

        maxRows = 25000
        i = 0
        output_rows = []
        start_date = datetime.strptime(self.start_date, "%Y-%m-%d")
        end_date = datetime.strptime(self.end_date, "%Y-%m-%d")
        
        def date_range(start_date, end_date, delta=timedelta(days=1)):

            current_date = start_date
            while current_date <= end_date:
                yield current_date
                current_date += delta
        print('script start date:', start_date)

        for date in date_range(start_date, end_date):
            date = date.strftime("%Y-%m-%d")
            i = 0
            while True:

                request = {
                    'startDate' : date,
                    'endDate' : date,
                    'dimensions' : ["page"],
                    "searchType": "Web",
                    'rowLimit' : maxRows,
                    'startRow' : i * maxRows
                }

                response = webmasters_service.searchanalytics().query(siteUrl = SITE_URL, body=request).execute()
                if response is None:
                    break
                if 'rows' not in response:
                    break
                else:
                    for row in response['rows']:
                        page = row['keys'][0]
                        output_row = [page, row['clicks'], row['impressions'], row['position']]
                        output_rows.append(output_row)
                    i = i + 1
        print('script end date:', end_date)

        df = pd.DataFrame(output_rows, columns=['Address', 'URL Clicks', 'URL Impressions', 'URL Average Position'])
        df = df.groupby(['Address']).agg({'URL Clicks':'sum','URL Impressions':'sum','URL Average Position':'mean'}).reset_index()
        df['URL CTR'] = df['URL Clicks'] / df['URL Impressions'] 
        return df

    def gsc_kw(self):

        SITE_URL = self.website

        OAUTH_SCOPE = ('https://www.googleapis.com/auth/webmasters.readonly', 'https://www.googleapis.com/auth/webmasters')

        # Redirect URI for installed apps
        REDIRECT_URI = 'urn:ietf:wg:oauth:2.0:oob'
        
        
        # You must edit gsc_credentials and pickled_credentials to include YOUR username
        gsc_credentials = r'your credentials '
        
        # where your pickled credential will be stored

        pickled_credentials = r'c:\users\your_username\desktop\pickled_credential'
        


        try:
            credentials = pickle.load(open(pickled_credentials  + ".pickle", "rb"))
        except (OSError, IOError) as e:
            flow = InstalledAppFlow.from_client_secrets_file(gsc_credentials, scopes=OAUTH_SCOPE)
            credentials = flow.run_console()
            pickle.dump(credentials, open(pickled_credentials  + ".pickle", "wb"))

            # Connect to Search Console Service using the credentials 
        webmasters_service = build('webmasters', 'v3', credentials=credentials)

        maxRows = 25000
        i = 0
        output_rows = []
        start_date = datetime.strptime(self.start_date, "%Y-%m-%d")
        end_date = datetime.strptime(self.end_date, "%Y-%m-%d")
        
        def date_range(start_date, end_date, delta=timedelta(days=1)):

            current_date = start_date
            while current_date <= end_date:
                yield current_date
                current_date += delta
        print('script start date:', start_date)

        for date in date_range(start_date, end_date):
            date = date.strftime("%Y-%m-%d")
            i = 0
            while True:

                request = {
                    'startDate' : date,
                    'endDate' : date,
                    'dimensions' : ["page",'query'],
                    "searchType": "Web",
                    'rowLimit' : maxRows,
                    'startRow' : i * maxRows
                }

                response = webmasters_service.searchanalytics().query(siteUrl = SITE_URL, body=request).execute()
                if response is None:
                    break
                if 'rows' not in response:
                    break
                else:
                    for row in response['rows']:
                        page = row['keys'][0]
                        keyword = row['keys'][1]
                        output_row = [ page,keyword, row['clicks'], row['impressions'], row['ctr'], row['position']]
                        output_rows.append(output_row)
                    i = i + 1
        print('script end date:', end_date)

        df = pd.DataFrame(output_rows, columns=['Address','Main Keyword', 'KW Clicks', 'KW Impressions', 'KW CTR',  'KW Average Position'])
        df = df.groupby(['Address','Main Keyword']).agg({'KW Clicks':'sum','KW Impressions':'sum','KW Average Position':'mean'}).reset_index()
        df['KW CTR'] = df['KW Clicks'] / df['KW Impressions'] 
        return df

In [1]:
## ngram class

class n_gram:

    def __init__(self,data, column):

        self.data = data
        self.column = column
        
    def generate_N_grams(self,text,ngram=1):
      self.text = text
      self.ngram = ngram
      words=[word for word in text.split(" ")]  
      print("Sentence after removing stopwords:",words)
      temp=zip(*[words[i:] for i in range(0,ngram)])
      ans=[' '.join(ngram) for ngram in temp]
      return ans
        
    def n_gram_function(self , s):  
        data = self.data
        column = self.column
        gram = defaultdict(int)
        for text in data[column]:
              for word in self.generate_N_grams(text,s):
                gram[word]+=1
        gram = pd.DataFrame(sorted(gram.items(),key=lambda x:x[1],reverse=True))
        return gram
    
    
    ## Returns Unigram 
    def unigram(self):
        unigram = self.n_gram_function(1)
        return unigram
    
    ## Returns Bigram 
    def bigram(self):
        bigram = self.n_gram_function(2)
        return bigram

    ## Returns Trigram 
    def trigram(self):
        trigram = self.n_gram_function(3)
        return trigram

    ## Returns Quadgram 
    def quadgram(self):
        quadgram = self.n_gram_function(4)
        return quadgram
    
    ## Returns fivegram 
    def quintgram(self):
        quintgram = self.n_gram_function(5)
        return quintgram

In [None]:
# the following code pulls data from GSC's API for the date ranges we want to compare

current_date = gsc_api('https://yourwebsite.com/','2022-08-01','2022-10-31')
prev_date = gsc_api('https://yourwebsite.com/','2022-05-01','2022-07-31')

# We are requesting URL and KW level data here 

url_level_data_current = current_date.url_level_data()
kw_level_data_current = current_date.gsc_kw()

url_level_data_prev = prev_date.url_level_data()
kw_level_data_prev = prev_date.gsc_kw()

In [None]:
'''this code merges data from the 2 date ranges we requested. 
Then it does some basic subtraction to see differences in Clicks, Impressions and Average Position at the URL level
as well as KW level changes '''


url_level_data_prev = url_level_data_prev.rename(columns = {'URL Clicks':'Prev URL Clicks','URL Impressions':'Prev URL Impressions','URL Average Position':'Prev URL Average Position','URL CTR':'Prev URL CTR'})

df = url_level_data_current.merge(url_level_data_prev, how = 'outer',  on = 'Address')
df = df.fillna(0)


# subtracting Clicks, Impressions and Avg. Positions at the  URL level 

df['URL Clicks Diff'] = df['URL Clicks'] - df['Prev URL Clicks']
df['URL Impressions Diff'] = df['URL Impressions'] - df['Prev URL Impressions']
df['URL Average Position Diff'] =  df['Prev URL Average Position'] - df['URL Average Position']
traffic_drop = df[df['URL Clicks Diff'] <= -1]
traffic_drop['Subfolder'] = traffic_drop['Address'].str.split('/').str[3]
subfolder_drops = traffic_drop[['Subfolder','URL Clicks Diff','URL Impressions Diff','URL Average Position Diff']].groupby('Subfolder').agg(['sum','mean', 'count']).reset_index()
df = df.fillna(0)

df['URL Clicks Diff'] = df['URL Clicks'] - df['Prev URL Clicks']
df['URL Impressions Diff'] = df['URL Impressions'] - df['Prev URL Impressions']
df['URL Average Position Diff'] =  df['Prev URL Average Position'] - df['URL Average Position']

traffic_drop = df[df['URL Clicks Diff'] <= -1]
traffic_drop['Subfolder'] = traffic_drop['Address'].str.split('/').str[3]

subfolder_drops = traffic_drop[['Subfolder','URL Clicks Diff','URL Impressions Diff','URL Average Position Diff']].groupby('Subfolder').agg(['sum','mean', 'count']).reset_index()

subfolder_drops.columns = ['_'.join(col) for col in subfolder_drops.columns]

subfolder_drops.sort_values(by = 'URL Clicks Diff_sum')

### Subfolder Changes
- the below code will show you subfolders that had the biggest drop in traffic 

In [None]:
subfolder_drops.sort_values(by = 'URL Clicks Diff_sum')

### Analyze Keywords that saw a dip in Clicks and Avg Position 
- Criteria may need to be adjusted depending on the size of your site! 

In [None]:
# Merge keyword data and subtract to find drops in clicks, impressions and avg. position

kw_level_data_prev = kw_level_data_prev.rename(columns = {'KW Clicks':'Prev KW Clicks','KW Impressions':'Prev KW Impressions','KW Average Position':'Prev KW Average Position','KW CTR':'Prev KW CTR'})
kw_data = kw_level_data_current.merge(kw_level_data_prev, how = 'outer', on =['Address','Main Keyword'])

kw_data[kw_data.columns[2:]] = kw_data[kw_data.columns[2:]].fillna(0)

kw_data['KW Clicks Diff']= kw_data['KW Clicks'] - kw_data['Prev KW Clicks']
kw_data['KW Impressions Diff']= kw_data['KW Impressions'] - kw_data['Prev KW Impressions']
kw_data['KW Average Position Diff']= kw_data['Prev KW Average Position'] - kw_data['KW Average Position']
kw_data['KW CTR Diff']= kw_data['KW CTR'] - kw_data['Prev KW CTR']

In [None]:
kw_drops = kw_data[(kw_data['KW Clicks Diff'] <= -1) & (kw_data['KW Average Position Diff'] <= -1)].sort_values(by = 'KW Clicks Diff').head(10)

In [None]:
## Return 50 Keywords that saw the biggest drop in traffic 
kw_drops.head(50)

### Run an Ngram Analysis on your keyword data 

#### Questions to ask yourself:
- Are there certain keywords that are dropping in ranks & clicks? (Review / Best / How to / What is , etc.) 
- Are there topics on your site that are dropping in rank and clicks?

In [None]:
## run an n_gram analysis 
ngram_kw_data =  n_gram(kw_drops, column = 'Main Keyword')

# Bigram
bigram_kw_data = ngram_kw_data.bigram()

# Trigram
trigram_kw_data = ngram_kw_data.trigram()

# Quadgram
quadgram_kw_data = ngram_kw_data.quadgram()


### Run an Ngram Analysis on your URL Level data 
- The following code runs a Screaming Frog Crawl on your website to return Title, H-1 and H-2s for each URL
- then merges it with the GSC URL level data we pulled from earlier 
- Finally, we can run an Ngram analysis on our Title, H-1 or H-2s

#### Questions to ask yourself:
- Are there certain URLs / Page titles that are dropping in ranks & clicks? (Review / Best / How to / What is , etc.) 
- Are certain topics on your site being affected?

In [None]:
class sf_crawl:

    def __init__(self,website, output_folder):

        self.website = website
        self.output_folder = output_folder
        
    def urls(self):
        website = self.website
        output_folder = self.output_folder
        sf_command = os.system('cd "C:\Program Files (x86)\Screaming Frog SEO Spider" && ScreamingFrogSEOSpiderCli.exe --crawl {} --headless --output-folder {} --export-tabs "Internal:All"'\
            .format(website,output_folder))
        df = pd.read_csv(output_folder + '\internal_all.csv')

In [None]:
crawl = sf_crawl('https://yourwebsite.com/',output_folder = r'c:\users\your_output_folder')

sf_data = crawl.urls()

# filters out non-indexable URLs

sf_data = sf_data[sf_data['Indexability'] == 'Indexable'][['Address','Indexability','Title 1','H1-1','H2-1','H2-2']]

traffic_drop = traffic_drop.merge(sf_data, how = 'left', on ='Address')


In [None]:
## run an n_gram analysis on URLs that had the largest traffic drop 
ngram_url_data =  n_gram(traffic_drop, column = 'Title 1')

# Bigram
bigram_url_data = ngram_url_data.bigram()

# Trigram
trigram_url_data = ngram_url_data.trigram()

# Quadgram
quadgram_url_data = ngram_url_data.quadgram()