# European Football Clubs Google Keywords Rankings

The most popular European Football clubs are some of the most searched keywords in many places. This is a quick overview of the domains that rank for those clubs' keywords. 

## Methodology

- **Club selection:** I got the top clubs from the Wikipedia list, containing all [clubs that won at least one UEFA championship](https://en.wikipedia.org/wiki/List_of_UEFA_club_competition_winners)
- **Keyword selection:** Every club name was appended with the word "football" to make it explicit and clear that it is the club and not the city (where applicable). I also did the same for seven of the top languages (based on the top seven countries who's clubs won the most championships). As a result the same keyword was requested seven times.  
Example:
'real madrid football', 'real madrid fútbol', 'real madrid fußball', 'milan football, 'milan fútbol', 'milan fußball', etc.

- **Resulting data set:**  
clubs: 79  
languages: 7  
queries: 79 x 7 = 553  
results: 10 x 553 = 5,530  

### Packages and versions

In [None]:
import advertools as adv
import pandas as pd
pd.options.display.max_columns = None
from plotly.tools import make_subplots
import plotly.graph_objs as go
import plotly
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()

print('Package         Version')
print('=' * 25)
for package in [plotly, pd, adv]:
    print(f'{package.__name__:<15}', ': ', package.__version__, sep='')

### Generating the data
The following code was used to generate the dataset.  
First we get two tables from the Wikpedia article and save them as CSV files: 

In [None]:
# page = 'https://en.wikipedia.org/wiki/List_of_UEFA_club_competition_winners'
# column_key = pd.read_html(page)[0]
# column_key = column_key.rename(columns={0: 'abbreviation', 1: 'tournament'})
# column_key.to_csv('column_key.csv', index=False)
# clubs = pd.read_html(page)[1]
# clubs.to_csv('clubs.csv', index=False)

`column_key` is a table that simply lists the abbreviations in the bigger table and their expansions.

In [None]:
column_key = pd.read_csv('../input/column_key.csv')
column_key

`clubs` is the DataFrame that we will be working with, and below is a sample.

In [None]:
clubs = pd.read_csv('../input/clubs.csv')
clubs.head(10)

A quick exploration of the data set. 

In [None]:
top_countries = (clubs
                 .groupby('Country')
                 .agg({'Total': 'sum'})
                 .sort_values('Total', ascending=False)
                 .reset_index()
                 .head(10))
top_countries

In [None]:
(clubs
 .groupby(['Country'])
 .agg({'Club': 'count', 'Total': 'sum'})
 .sort_values('Club', ascending=False)
 .reset_index()
 .head(9)
 .set_axis(['country', 'num_clubs', 'total_wins'], axis=1, inplace=False)
 .assign(wins_per_club=lambda df: df['total_wins'].div(df['num_clubs']))
 .style.background_gradient(high=0.2))



* More English teams won tournaments than any other country, while Spanish teams won more tournaments per club (and more tournaments overall). 

The names of the clubs will be used to run the requests to Google.

In [None]:
clubs_list = clubs['Club'].str.lower().tolist()
clubs_list[:10]

`lang_football` is a simple dictionary listting the seven languages, and the word 'football' in that language. 

In [None]:
lang_football = {'en': 'football',
                 'fr': 'football',
                 'de': 'fußball',
                 'es': 'fútbol',
                 'it': 'calcio',
                 'pt-BR': 'futebol',
                 'nl': 'voetbal'}
lang_football

The code in the following cell generates the data set. There is some setup that needs to be done if you want to run the code yourself.

1. [Create a custom search engine.](https://cse.google.com/cse/) At first, you might be asked to enter a site to search. Enter any domain, then go to the control panel and remove it. Make sure you enable "Search the entire web" and image search. You will also need to get your search engine ID, which you can find on the control panel page.
2. [Enable the custom search API.](https://console.cloud.google.com/apis/library/customsearch.googleapis.com) The service will allow you to retrieve and display search results from your custom search engine programmatically. You will need to create a project for this first.
3. [Create credentials for this project](https://console.developers.google.com/apis/api/customsearch.googleapis.com/credentials) so you can get your key.
4. [Enable billing for your project](https://console.cloud.google.com/billing/projects) if you want to run more than 100 queries per day. The first 100 queries are free; then for each additional 1,000 queries, you pay USD $5.

The [`advertools`](https://github.com/eliasdabbas/advertools) function `serp_goog` can take several possible parameters to customize the search query, and for this one we will be using two only: 

* `q`: The query we are searching for. Note that this can be a list of queries, and the looping is done for you, as in this case. 
* `hl`: The interface language (human-language). This tells Google to return results for a user who is using a computer/browser in this specific interface language. Like all other parameters of the function, `hl` can also be provided as a list of languages. When running hundreds of queries I like to split them into a few chunks, just in case something goes wrong (a connection error for example). But you can actually generate the whole data set with one function call. 

In [None]:
# cx = 'YOUR_CX_FROM_GOOGLE'
# key = 'YOUR_GOOGLE_DEV_KEY'

# serp_dfs = []
# for lang, q in lang_football.items():
#     temp_serp = adv.serp_goog(cx=cx, key=key, 
#                               hl=lang,
#                               q=[club + ' ' + q for club in clubs_list])
#     serp_dfs.append(temp_serp)

# serp_clubs = pd.concat(serp_dfs, sort=False)
# serp_clubs.to_csv('serp_clubs.csv', index=False)

In [None]:
serp_clubs = pd.read_csv('../input/serp_clubs.csv', parse_dates=['queryTime'])
print(serp_clubs.shape)
serp_clubs.head()

I think it's a good idea to have the country of each club, and the club itself, as separate columns so we can group and analyze by country and club.  
We first create the `club_country` dictionary, which simply maps the clubs to their respective countries from our `clubs` DataFrame.  
Then we create our regular expression to remove all the 'football' words. This way we can get the corresponding country for each extracted club. The same dictionary can be used to extract clubs. 

In [None]:
club_country = {club.lower(): country.lower() for club, country in zip(clubs['Club'], clubs['Country'])}
football_multi = '|'.join([' ' + football for football in lang_football.values()])

serp_clubs['country'] = [club_country[club].title()
                         for club in serp_clubs['searchTerms'].str.replace(football_multi, '')]
serp_clubs['club'] = serp_clubs['searchTerms'].str.replace(football_multi, '').str.title()
serp_clubs[['searchTerms', 'country', 'club']].sample(10)

## Top domains

In [None]:
print('unique domains:', serp_clubs['displayLink'].nunique())
print('number of results:', serp_clubs.__len__())
serp_clubs['displayLink'].value_counts().reset_index()[:10]

As you can see the top domains ranking for these keywords are dominated by Wikipedia. This is is not surprising, because the keywords are quite generic. Also, these are the top domains for the whole data set. It would be better to check the same for each language, country, or club, to get a more meaningful summary. 

#### Top domains for Barcelona:

In [None]:
serp_clubs[serp_clubs['club']=='Barcelona']['displayLink'].value_counts().reset_index()[:10]

#### Top domains for German keywords:

In [None]:
serp_clubs[serp_clubs['hl']=='de']['displayLink'].value_counts().reset_index()[:10]

#### Top domains for Italian clubs:

In [None]:
serp_clubs[serp_clubs['country']=='Italy']['displayLink'].value_counts().reset_index()[:10]

Either Italian sites need to focus on their SEO or Italian teams are extremely popular in other languages! 
The above can be run for any other parameter, or combination of parameters as well.  

I think it's also good to see if there are certain URLs that are dominant. The `link` column shows the actual landing page that the user will be directed to. 

In [None]:
serp_clubs['link'].value_counts().reset_index()[:10]

It seems seven is the highest number of appearances on SERPs for any particular URL. So we don't have any dominant landing pages, as we do with domains.  


## Top-level domains (TLDs)

Since we are researching clubs that belong to national leagues, and we are also searching in different languages, it might be interesting as well to check for the most used TLDs. Are they mostly .com or is there a big percentage that is on a local domain?

[`advertools`](https://github.com/eliasdabbas/advertools) provides the `extract_urls` function among other `extract_` functions that help in getting data on hashtags, mentions, URLs, and more. In our case we would be interested in the `top_tlds` key in the resulting dictionary:

In [None]:
adv.extract_urls(serp_clubs['link'])['top_tlds'][:10]

We can also expand this to see totals, percentages, cumulative sums, and cumulative percentages for each of the TLDs in our data set: 

In [None]:
(pd.DataFrame({
    'tld': [x[0] for x in  adv.extract_urls(serp_clubs['link'])['top_tlds']],
    'freq': [x[1] for x in  adv.extract_urls(serp_clubs['link'])['top_tlds']]
}).assign(percentage=lambda df: df['freq'].div(df['freq'].sum()),
          cumsum=lambda df: df['freq'].cumsum(), 
          cum_perc=lambda df: df['cumsum'].div(df['freq'].sum()))
 .head(15)
 .style.format({'percentage': '{:.2%}', 'cumsum': '{:,}', 'cum_perc': '{:.2%}'}))

## Word frequency 
Checking the most commonly used words in a text list can help us in understanding what this list is about. A simple way is to use the `word_frequency` function from `advertools`.

Here I check the word counts in the titles of those pages: 

In [None]:
adv.word_frequency(serp_clubs['title'],
                   rm_words=adv.stopwords['english'].union(['-', '|', '  ', ''])).head(10)

Nothing surprising here. Mostly the generic words that you would expect to see.  
The same can be done by getting a subset of the data set, for example, these are the most used words in titles in Dutch: 

In [None]:
adv.word_frequency(serp_clubs[serp_clubs['hl']=='nl']['title'],
                   rm_words=adv.stopwords['english'].union(['-', '|', '  ', ''])).head(10)

Some websites don't expose their snippets to search engines, and it's good to see if we have a lot of those: 

In [None]:
serp_clubs['snippet'].isna().sum(), serp_clubs['title'].isna().sum()

In [None]:
serp_clubs[serp_clubs['snippet'].isna()]['displayLink'].value_counts()

Only three domains are doing this, on a total of thirty one landing pages. Not a big issue.  
We can also check if there are any interesting words used in the snippets:

In [None]:
adv.word_frequency(serp_clubs['snippet'].fillna(''),
                   rm_words=adv.stopwords['english'].union(['-', '|', '  ','·', '', 'de'])).head(15)

Word counts in snippets in English:

In [None]:
adv.word_frequency(serp_clubs[serp_clubs['hl']=='en']['snippet'].fillna(''),
                   rm_words=adv.stopwords['english'].union(['-', '|', '  ','·', '', 'de'])).head(15)

Now it is a bit more specific, and you might want to dig deeper and see which domains focus on what kind of words; statistics, results, fixtures, etc.  
Counting words can also be in the form of short phrases. For example, here we count the 2-word phrases used in the snippets of all SERPs for Liverpool.  
We only have to specify the `phrase_len` parameter to any length we want. 

In [None]:
adv.word_frequency(serp_clubs[serp_clubs['club']=='Liverpool']['snippet'].fillna(''),
                   phrase_len=2,
                   rm_words=adv.stopwords['english'].union([ '|', '', 'de'])).head(20)

## Competitiveness: number of available results
As you know it's also very important to know how competitive your keywords are. One of the measures is how many pages are eligible to appear for a particular keyword. More pages usually means more competition, but not necessarily. A small number of domains might be doing very high quality/aggressive SEO, which would make it more competitive. But the number is still a good measure, because usually if a keyword is worth competing for, it is usually a popular topic and many websites would be writing about it. 

#### Total results by keyword:

In [None]:
(serp_clubs
 .drop_duplicates(['searchTerms'])
 .groupby('searchTerms', as_index=False)
 .agg({'totalResults': 'sum'})
 .sort_values('totalResults', ascending=False)
 .reset_index(drop=True)
 [:10]
 .style.format({'totalResults': '{:,}'}))

#### Total results by club (across all languages):

In [None]:
(serp_clubs
 .drop_duplicates(['searchTerms'])
 .groupby('club', as_index=False)
 .agg({'totalResults': 'sum'})
 .sort_values('totalResults', ascending=False)
 .reset_index(drop=True)
 [:10]
 .style.format({'totalResults': '{:,}'}))

Good luck trying to rank for any of those keywords! 

## Top domains per language

In [None]:
fig = make_subplots(1, 7, print_grid=False, shared_yaxes=True)
for i, lang in enumerate(serp_clubs['hl'].unique()[:7]):
    df = serp_clubs[serp_clubs['hl']==lang]
    
    fig.append_trace(go.Bar(y=df['displayLink'].value_counts().values[:8], 
                            x=df['displayLink'].value_counts().index.str.replace('www.', '')[:8],
                            name=lang,
                            orientation='v'), row=1, col=i+1)


fig.layout.margin = {'b': 150, 'r': 30}
fig.layout.legend.orientation = 'h'
fig.layout.legend.y = -0.5
fig.layout.legend.x = 0.15
fig.layout.title = 'Top Domains by Language of Search'
fig.layout.yaxis.title = 'Number of Appearances on SERPs'
fig.layout.plot_bgcolor = '#eeeeee'
fig.layout.paper_bgcolor = '#eeeeee'
iplot(fig)

In [None]:
fig = make_subplots(1, 7, shared_yaxes=True, print_grid=False)
for i, country in enumerate(serp_clubs['country'].unique()[:7]):
    if country in top_countries['Country'][:7].values:
        df = serp_clubs[serp_clubs['country']==country]

        fig.append_trace(go.Bar(y=df['displayLink'].value_counts().values[:8], 
                                x=df['displayLink'].value_counts().index.str.replace('www.', '')[:8],
                                name=country,
                                orientation='v'), row=1, col=i+1)

fig.layout.margin = {'b': 150, 'r': 0}
fig.layout.legend.orientation = 'h'
fig.layout.legend.y = -0.5
fig.layout.legend.x = 0.15
fig.layout.title = 'Top Domains by Country of Club'
fig.layout.yaxis.title = 'Number of Appearances on SERPs'
fig.layout.plot_bgcolor = '#eeeeee'
fig.layout.paper_bgcolor = '#eeeeee'
iplot(fig)

In the last two charts a higher number of appearances shows that for that language, there is more concentration of ranking in a few domains. 

## SERP summary/visualization

Finally, we can visually summarize the results by showing which domain appeared, on each position, and how many times each. The follosing function is copied from a recipe I created to [visualize and summarize SERPs.](https://www.kaggle.com/eliasdabbas/coffee-and-cafe-search-engine-rankings-on-google)

In [None]:
def plot_serps(df, opacity=0.1, num_domains=10, width=None, height=700):
    """
    df: a DataFrame resulting from running advertools.serp_goog
    opacity: the opacity of the markers [0, 1]
    num_domains: how many domains to plot
    """
    top_domains = df['displayLink'].value_counts()[:num_domains].index.tolist()
    top_df = df[df['displayLink'].isin(top_domains)]
    top_df_counts_means = (top_df
                       .groupby('displayLink', as_index=False)
                       .agg({'rank': ['count', 'mean']})
                       .set_axis(['displayLink', 'rank_count', 'rank_mean'],
                                 axis=1, inplace=False))
    top_df = (pd.merge(top_df, top_df_counts_means)
          .sort_values(['rank_count', 'rank_mean'],
                       ascending=[False, True]))
    rank_counts = (top_df
               .groupby(['displayLink', 'rank'])
               .agg({'rank': ['count']})
               .reset_index()
               .set_axis(['displayLink', 'rank', 'count'],
                         axis=1, inplace=False))
    num_queries = df['queryTime'].nunique()
    fig = go.Figure()
    fig.add_scatter(x=top_df['displayLink'].str.replace('www.', ''),
                    y=top_df['rank'], mode='markers',
                    marker={'size': 35, 'opacity': opacity},
                    showlegend=False)
    fig.layout.height = 600
    fig.layout.yaxis.autorange = 'reversed'
    fig.layout.yaxis.zeroline = False
    fig.add_scatter(x=rank_counts['displayLink'].str.replace('www.', ''),
                y=rank_counts['rank'], mode='text',
                marker={'color': '#000000'},
                text=rank_counts['count'], showlegend=False)
    for domain in rank_counts['displayLink'].unique():
        rank_counts_subset = rank_counts[rank_counts['displayLink']==domain]
        fig.add_scatter(x=[domain.replace('www.', '')],
                        y=[11], mode='text',
                        marker={'size': 50},
                        text=str(rank_counts_subset['count'].sum()))
        fig.add_scatter(x=[domain.replace('www.', '')],
                        y=[12], mode='text',
                        text=format(rank_counts_subset['count'].sum() / num_queries, '.1%'))
        fig.add_scatter(x=[domain.replace('www.', '')],
                        y=[13], mode='text',
                        marker={'size': 50},
                        text=str(round(rank_counts_subset['rank']
                                       .mul(rank_counts_subset['count'])
                                       .sum() / rank_counts_subset['count']
                                       .sum(),2)))
#     fig.layout.title = ('Google Search Results Rankings<br>keyword(s): ' + 
#                         ', '.join(list(df['searchTerms'].unique()[:5])) + 
#                         str(df['queryTime'].nunique()) + ' Football (Soccer) Queries')
    fig.layout.hovermode = False
    fig.layout.yaxis.autorange = 'reversed'
    fig.layout.yaxis.zeroline = False
    fig.layout.yaxis.tickvals = list(range(1, 14))
    fig.layout.yaxis.ticktext = list(range(1, 11)) + ['Total<br>appearances','Coverage', 'Avg. Pos.'] 
    fig.layout.height = height
    fig.layout.width = width
    fig.layout.yaxis.title = 'SERP Rank (number of appearances)'
    fig.layout.showlegend = False
    fig.layout.paper_bgcolor = '#eeeeee'
    fig.layout.plot_bgcolor = '#eeeeee'
    return fig

In [None]:
fig = plot_serps(serp_clubs, opacity=0.05)
fig.layout.title = 'SERPs for "<club_name> football" (79 clubs)'
iplot(fig)

In [None]:
fig = plot_serps(serp_clubs[serp_clubs['hl']=='es'], opacity=0.15)
fig.layout.title = 'SERPs for "<club_name> fútbol" in Spanish (79 clubs)'
iplot(fig)

In [None]:
fig = plot_serps(serp_clubs[serp_clubs['hl']=='en'], opacity=0.15)
fig.layout.title = 'SERPs for "<club_name> football" in English (79 clubs)'
iplot(fig)

In [None]:
fig = plot_serps(serp_clubs[serp_clubs['hl']=='de'], opacity=0.15)
fig.layout.title = 'SERPs for "<club_name> fußball" in German (79 clubs)'
iplot(fig)

In [None]:
fig = plot_serps(serp_clubs[serp_clubs['club']=='Liverpool'], opacity=0.15, num_domains=15)
fig.layout.title = 'SERPs for "liverpool football"'
iplot(fig)

Many other options can be explored: 
* Try with other more specific keywords. <club_name> tickets, results, transfers, etc. 
* Summarize/visualize other combinations of languages, clubs, countries. 
* Try other search parameters like user geo-location, search type, etc.

## Further resources for getting, visualizing, and analyzing Google SERPs:


* [A tutorial on how to use the `serp_goog`](https://www.kaggle.com/eliasdabbas/search-engine-results-pages-serps-research) function and how the different parameters work: 
* [Analyze flights and tickets SERPs](https://www.semrush.com/blog/analyzing-search-engine-results-pages/) (article on SEMrush)
* [Analyze Google and YouTube SERPs for the same keywords](https://www.kaggle.com/eliasdabbas/recipes-keywords-ranking-on-google-and-youtube)
* [Documentation of the `serp_goog` function](https://advertools.readthedocs.io/en/master/advertools.html#module-advertools.serp)
* [Text analysis for online marketing](https://www.semrush.com/blog/text-analysis-for-online-marketers) is a tutorial to explains the `word_freq` function and what can be done to count words on an absolute and weighted basis. 