This notebook computes the full_name/abbreviation share for each department that takes into account the departments likelihood to publish articles under the full name or the abbreviation. We can then scale the number of articles written in that department by the factor 50/abbreviation_share if the author entity is an abbreviation and 50/full_name_share if the author entity is a full name.

In [1]:
import json
import sqlite3

import pandas as pd

from src.models.MatchingType import MatchingType

In [2]:
con = sqlite3.connect('../data/interim/articles_with_author_mapping.db')
cur = con.cursor()

In [3]:
# get all articles with affiliated authors that are not organizations
rows = cur.execute("SELECT ar.id, ar.article_namespace_array, ar.published_at, a.name, a.abbreviation, a.matching_type FROM articles ar join unmapped_article_authors aa on ar.id = aa.article_id join unmapped_authors a on aa.author_id = a.id where a.matching_type != ?", (MatchingType.ORGANIZATION_MATCH.name, )).fetchall()

In [56]:
departments = pd.DataFrame(columns=['id', 'department', 'published_at', 'name', 'abbreviation', 'matching_type'], data=rows)

In [57]:
departments["department"] = departments["department"].apply(lambda x: json.loads(x))
departments = departments.explode('department')

In [58]:
departments.loc[departments["matching_type"] == MatchingType.FUZZY_MATCH, "matching_type"] = MatchingType.IS_ABBREVIATION
departments.loc[departments["matching_type"] == MatchingType.DIRECT_MATCH, "matching_type"] = MatchingType.IS_ABBREVIATION

In [59]:
# filter out departments that do not have both the matching_types
departments = departments.groupby('department').filter(lambda x: all(match_type in x['matching_type'].values for match_type in ['IS_ABBREVIATION', 'IS_FULL_NAME']))

In [60]:
# get count for matching type
grouped_departments = departments.groupby(["department", "matching_type"]).size().reset_index(name='count').sort_values(['department', 'matching_type', 'count'], ascending=[True, True, False])


In [62]:
# Convert 'matching_type' column to categorical to ensure proper sorting
grouped_departments['matching_type'] = pd.Categorical(grouped_departments['matching_type'], categories=['IS_ABBREVIATION', 'IS_FULL_NAME'], ordered=True)

# Pivot the DataFrame to have separate columns for each matching type
pivoted_departments = grouped_departments.pivot(index='department', columns='matching_type', values='count').reset_index()

# Calculate shares
pivoted_departments['abbreviation_share'] = pivoted_departments['IS_ABBREVIATION'] / (pivoted_departments['IS_ABBREVIATION'] + pivoted_departments['IS_FULL_NAME'])
pivoted_departments['full_name_share'] = pivoted_departments['IS_FULL_NAME'] / (pivoted_departments['IS_ABBREVIATION'] + pivoted_departments['IS_FULL_NAME'])

# create new df with ['department', 'abbreviation_share', 'full_name_share'] and a normal index
departments_scaler_score = pivoted_departments[['department', 'abbreviation_share', 'full_name_share']].copy()
# set default index name
departments_scaler_score.index = range(len(departments_scaler_score))


Note: the departments_scaler_score df keeps the name matching_type as the index name for some reason. TODO: fix this  


The scaler is then given by 50/share for each department. Each gets multiplied with the article count of each department for the entity.

In [64]:
departments_scaler_score

matching_type,department,abbreviation_share,full_name_share
0,1.-FC-Lok,0.539062,0.460938
1,30-Jahre-Friedliche-Revolution,0.100000,0.900000
2,7-Seen-Wanderung,0.666667,0.333333
3,Abo,0.147059,0.852941
4,Achtung-Baustelle,0.465517,0.534483
...,...,...,...
140,Wirtschaft-Regional,0.245983,0.754017
141,Wissen,0.500000,0.500000
142,Wurzen,0.084559,0.915441
143,Zoo-Leipzig,0.915789,0.084211
