# Analysis of Solving Topics from Google Search Console Data

The goal of the [SymPy 2022 Season of Docs project](https://github.com/sympy/sympy/wiki/Season-of-Docs-2022-Organization-Application), is to create documentation so that users can easily learn how to solve their mathematical problems.

Analysis of the Google search terms (below) reveals that searchers conceive of "solving" more broadly than SymPy's various solving functions; for example, searchers may want to "solve" an integral. So to best meet users' needs, this project will include any type of highly-sought "solving".

The new *solving* main page will thus help direct users to the type of solving they need, whether it be one of SymPy’s solvers or “solving” an integral, etc. The sub-pages will contain step-by-step guides for how to solve each type of mathematical problem.

We will prioritize the types based on interest. Because the SymPy community has opted not to have direct analytics installed on its documentation site due to user privacy concerns, we cannot track which page a user goes to next after a search leads them to a page on our site. Thus, we use more indirect analytics like Google Search Console. We extracted one year's worth of that data ending on April 15, 2022. Each search query, for example "sympy solve", is aggregated across the year with the following data:

- Top queries: search term
- Clicks: the number of times that a searcher clicked on a link to docs.symy.org site
- Impressions: the number of times the docs.symy.org link was presented to a searcher
- CTR: the click-through rate, that is what percentage of the time a user clicked on the docs.symy.org link that the search engine presented
- Position: where the docs.symy.org link appeared in search results (1 = first search result, 2 = second search result, etc.)

Note: SymPy has a function called [`solve`](https://docs.sympy.org/dev/modules/solvers/solvers.html?highlight=solve#sympy.solvers.solvers.solve) which is designed to find the roots of an equation or system of equations. SymPy `solve` may or may not be what users need for a particular problem, so this project's goal is to document commonly-requested types of "solving" regardless of the SymPy function best suited to the task.

In [6]:
import pandas
import numpy as np

In [7]:
# Read CSV data into pandas dataframe
df = pandas.read_csv('docs.sympy.org-Performance-on-Search-2022-04-15 - Queries.csv')

In [8]:
# Set up columns in dataframe
query, clicks, impressions, ctr, position, category, note, = df.columns

# Data cleanup: Remove percentage sign % from ctr
df[ctr] = df[ctr].str.rstrip('%').astype(float)

## Search Queries Involving "solv" Categorized by Type of Solving

Because the focus of the Season of Docs 2022 project is improving SymPy's solving documentation (broadly conceived), I filtered the 1000 top queries down to those containing the stem "solv" (to capture "solve", "solver", "solving", etc.), which gave 127 queries. I categorized each query into one of 15 categories. The categories were designed to represent searchers' conception the type of solving requested. The goal is to present a visitor to the docs.symy.org new *solving* page with a reasonable number of solving types, so the visitor can quickly find the type of solving they need and go to that sub-page. Any such categorization is subjective and may contain overlap between categories. The "general" category includes queries where the searcher's exact intent is unclear, for example "sympy solve" and "python solver".

For each category, I summed the clicks and impressions, and took the mean of the click-through rate (ctr) and position. To prioritize the categories, I sorted by impressions. I chose impressions rather than clicks because impressions counts the number of times docs.symy.org pages were deemed relevant by the search engine. Clicks, by contrast, also take into account whether the searcher chose the docs.symy.org link, which is affected by the page title, summary, etc.

In [9]:
# Group queries by category
groupby = df.groupby([category])

# Aggregate numerical columns, round them, then sort descending by impressions
groupby.agg(
    clicks_sum=(clicks, 'sum'),
    impressions_sum=(impressions, 'sum'),
    ctr_mean=(ctr, np.mean),
    position_mean=(position, np.mean),
).round(2).sort_values(["impressions_sum"], ascending=False)

Unnamed: 0_level_0,clicks_sum,impressions_sum,ctr_mean,position_mean
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
general,34813,73944,47.68,1.5
equation,11868,40430,33.4,2.57
numerical,3696,16844,43.12,1.93
partial differential equation(s),1485,16215,21.89,4.24
system of equations,2814,8513,39.58,2.2
ordinary differential equation(s),3751,6415,57.93,1.1
system of linear equations,2086,3437,58.08,1.1
system of nonlinear equations,1073,2224,47.44,1.15
matrix,1134,1918,61.83,1.06
inequality,596,995,59.46,1.02


The project will focus on those categories, prioritizing in roughly the order in the table the
- order of solving types on the main new *solving* page
- creation of pages with step-by-step guides for a given solving type.

# Individual Top Queries

For completeness, all 127 top queries containing "solv" are below, including their categorization.

In [10]:
# Ensure all rows will be output
pandas.set_option('max_rows', 1000)

# Wrap words onto another line--for Note column
pandas.set_option('display.max_colwidth', 0)

solv = df[query].str.contains("solv")
df[solv].fillna(value="")

Unnamed: 0,Top queries,Clicks,Impressions,CTR,Position,Category,Note
0,sympy solve,12662,20093,63.02,1.11,general,
14,python solver,3553,8720,40.75,1.27,general,
15,python solve equation,2913,11087,26.27,2.55,equation,
18,sympy solve equation,2368,3543,66.84,1.0,equation,
24,python equation solver,2092,5992,34.91,1.66,general,
31,solve sympy,1908,2691,70.9,1.03,general,
33,solve python,1854,5960,31.11,2.0,general,
36,solve equation python,1765,5769,30.59,2.24,equation,
45,sympy solve system of equations,1466,2514,58.31,1.01,system of equations,
46,solver python,1459,3365,43.36,1.24,general,
