### Installations & Imports

In [2]:
!pip install langdetect

You should consider upgrading via the '/Users/Jessie/Downloads/UZH/Thesis/zora_classifier/venv/bin/python -m pip install --upgrade pip' command.[0m


In [None]:
from langdetect import detect

# Initial reading of file

Original .csv file of Robin's (which was actually semicolon-separated, which was a problem for Excel) has been opened and re-exported from Numbers as .tsv with column headers added.

In [3]:
with open("data/robin_data.tsv") as f:
    rows = []

    for line in f:
        line = line.rstrip("\n")
        row = line.split("\t")

        # Only those with abstracts
        if row[-1] != "" and detect(row[-1]) == "en":
            rows.append(row)

# Remove header row
rows.pop(0)

LangDetectException: No features in text.

## "\"" issue

Error: `LangDetectException: No features in text.` Let's throw in a `try/except` condition.

In [4]:
with open("data/robin_data.tsv") as f:
    rows = []

    for line in f:
        line = line.rstrip("\n")
        row = line.split("\t")

        # Only those with abstracts
        if row[-1] != "":
            try:
                language = detect(row[-1])
            except:
                language = "error"
                print("This row throws an error:", row)
            rows.append(row)

# Remove header row
rows.pop(0)

This row throws an error: ['"']


['SDG', 'Author(s)', 'Faculty', 'Year', 'Citation', 'Citation 2', 'Abstract']

Apparently there's one row that's just a quote mark. I suspect it's a line/column break issue, so let's try getting the rows before and after it to find it in Numbers and manually assess.

In [5]:
rows.index(["\""])

459

In [10]:
rows[458:461]

[['SDG\xa015.00',
  '"Ozgul, Arpat"',
  'MNF',
  '2019',
  '"Life history responses of meerkats to seasonal changes in extreme environments. Science, 363(6427):631-635."',
  '"Life history responses of meerkats to seasonal changes in extreme environments. Science, 363(6427):631-635."',
  '"Species in extreme habitats increasingly face changes in seasonal climate, but the demographic mechanisms through which these changes affect population persistence remain unknown. We investigated how changes in seasonal rainfall and temperature influence vital rates and viability of an arid environment specialist, the Kalahari meerkat, through effects on body mass. We show that climate change–induced reduction in adult mass in the prebreeding season would decrease fecundity during the breeding season and increase extinction risk, particularly at low population densities. In contrast, a warmer nonbreeding season resulting in increased mass and survival would buffer negative effects of reduced rainfall

The one before is row 567 and the one after is 568 in Numbers, so " is not a row in the file. That means I'm not missing a data entry, so I won't try to debug it further and just toss it from my list.

As a sanity check, though, I will look for any other potential line/col break issues.

In [13]:
buggy_rows = [i for i in rows if len(i) != 7]

print([i for i in buggy_rows])

len(buggy_rows)



172

172 buggy rows! Quickly scanning the output, they all (or at least mostly) seem to be just abstracts, probably those that have a newline in them.

Let's try a quick and dirty way of checking that it's not also a col break issue by checking the lengths of each buggy row.

In [15]:
buggy_row_lens = [len(i) for i in buggy_rows]

print(min(buggy_row_lens))
print(max(buggy_row_lens))

1
1


Yay. Indeed it does seem to be a newline issue. Now the question is where the rest of their corresponding data (tag, author, title, etc.) went in my list of rows.

I will grab one buggy abstract and find its title in Numbers and look for that in my list.

In [16]:
buggy_rows[0]

['Results While microbial richness was marginally affected, we found pronounced cropping effects on community composition, which were specific for the respective microbiomes. Soil bacterial communities were primarily structured by tillage, whereas soil fungal communities responded mainly to management type with additional effects by tillage. In roots, management type was also the driving factor for bacteria but not for fungi, which were generally determined by changes in tillage intensity. To quantify an “effect size” for microbiota manipulation, we found that about 10% of variation in microbial communities was explained by the tested cropping practices. Cropping sensitive microbes were taxonomically diverse, and they responded in guilds of taxa to the specific practices. These microbes also included frequent community members or members co-occurring with many other microbes in the community, suggesting that cropping practices may allow manipulation of influential community members.']

The title for this abstract is "Cropping practices manipulate abundance patterns of root and soil microbiome members paving the way to smart farming. Microbiome, 6(1):14.", and it's row 21 in Numbers.

This is getting inefficient. I think I should add an ID to each row with an abstract in the original data so I can map faster. Once I fix this issue, I should convert my list of lists into a dict.

In [41]:
with open("data/robin_data.tsv") as f:
    rows = []
    row_id = 0

    for line in f:
        line = line.rstrip("\n")
        row = line.split("\t")

        # Only those with abstracts
        if row[-1] != "" and row[-1] != "\"" and detect(row[-1]) == "en":
            row_id += 1
            row.insert(0, row_id)
            rows.append(row)

len(rows[0])

8

No idea where the header row went, but the IDs are correct. Now let's re-code the `buggy_rows` bit, since the lengths of the rows are longer now.

In [40]:
buggy_rows = [i for i in rows if len(i) != 8]

print([i for i in buggy_rows])

len(buggy_rows)

149

WAIT WHY IS IT DIFFERENT NOW

Whatever. I have fewer buggy rows now so I guess that's good. Maybe I counted wrong earlier when I used `try/except` and didn't specify EN. Will check later if I remember/feel like it/have time.

In [27]:
buggy_rows[0]

[14,
 'Results While microbial richness was marginally affected, we found pronounced cropping effects on community composition, which were specific for the respective microbiomes. Soil bacterial communities were primarily structured by tillage, whereas soil fungal communities responded mainly to management type with additional effects by tillage. In roots, management type was also the driving factor for bacteria but not for fungi, which were generally determined by changes in tillage intensity. To quantify an “effect size” for microbiota manipulation, we found that about 10% of variation in microbial communities was explained by the tested cropping practices. Cropping sensitive microbes were taxonomically diverse, and they responded in guilds of taxa to the specific practices. These microbes also included frequent community members or members co-occurring with many other microbes in the community, suggesting that cropping practices may allow manipulation of influential community member

Okay, at least the first one is still the same. Maybe I should check max/min again.

In [28]:
buggy_row_lens = [len(i) for i in buggy_rows]

print(min(buggy_row_lens))
print(max(buggy_row_lens))

2
2


That's correct because each one has the ID prepended to it. Right, moving on. Have to find their corresponding data.

The title for the first abstract is "Cropping practices manipulate abundance patterns of root and soil microbiome members paving the way to smart farming. Microbiome, 6(1):14.", and it's row 21 in Numbers.

First let's find out if it's anywhere in `rows` by de-nesting and searching it.

In [42]:
flat_list = []

for i in rows:
    for j in i:
        flat_list.append(j)

flat_list.count("Cropping practices manipulate abundance patterns of root and soil microbiome members paving the way to smart farming. Microbiome, 6(1):14.")

0

Nope. Let's check two more.

In [31]:
buggy_rows[-1]

[524,
 'In future, value orientations should be given closer attention both in the preparation of soldiers for deployment and in therapy for psychic disorders associated with deployment."']

Corresponding title, 674 in Numbers: "Depressivität und Wert­orientierungen im Verlauf von militärischen Auslands­einsätzen. Trauma & Gewalt, 12(02):134-150."

Well now here's an interesting case. The abstract is in both DE and EN, but the title is DE so it must be a DE article for which they provided the EN abstract also. I guess I should `langdetect` the titles instead of the abstracts.

There are 547 when I `langdetect` the abstracts, but I can't do the titles until I get these buggy rows fixed because not all rows can be sliced the same.

In [34]:
flat_list.count("Depressivität und Wert­orientierungen im Verlauf von militärischen Auslands­einsätzen. Trauma & Gewalt, 12(02):134-150.")

0

In [32]:
buggy_rows[10]

[52,
 'TRIAL REGISTRATION: This RCT was funded by the Swiss National Science Foundation (100019_169781/1) and was registered on 18/06/2018 at ClinicalTrials.gov : NCT03575559 ."']

Title, x in Numbers: "A cluster randomized controlled trial comparing the effectiveness of an individual planning intervention with collaborative planning in adolescent friendship dyads to enhance physical activity (TWOgether). BMC Public Health, 18(1):911."

In [33]:
flat_list.count("A cluster randomized controlled trial comparing the effectiveness of an individual planning intervention with collaborative planning in adolescent friendship dyads to enhance physical activity (TWOgether). BMC Public Health, 18(1):911.")

0

Right, so it appears the titles are gone. Now to figure out what happened.

As a sanity check, let's do the same for a non-buggy row.

In [44]:
flat_list.count("Behavioural economics, experimentalism and the marketization of development. Economy and Society, 44(4):567-591.")

0

In [43]:
flat_list[:10]

[1,
 'SDG\xa01.00',
 '"Berndt, Christian"',
 'MNF',
 '2015',
 '"Behavioural economics, experimentalism and the marketization of development. Economy and Society, 44(4):567-591."',
 '"Behavioural economics, experimentalism and the marketization of development. Economy and Society, 44(4):567-591."',
 '"Using market-based pro-poor development policy in the global South as an example, this paper engages with the rise of behaviourism and experimentalism as a challenge to the neoclassical orthodoxy and the more recent transformation into an influential policy script. After charting the rise of behavioural economics and discussing the key conceptual building blocks of the emerging behavioural mainstream in economics, the paper turns to the marketization of anti-poverty policy in the global South. Based on an analysis of policy documents, project reports and academic interventions, it is argued that the behavioural approach to poverty shifts the focus from the market to the market subject and 

I should have put extra quote marks???

In [45]:
flat_list.count('"Behavioural economics, experimentalism and the marketization of development. Economy and Society, 44(4):567-591."')

2

Jesus Christ. Thank god for sanity checks. Leeeeet's try again with the buggy rows.

In [47]:
flat_list.count('"Cropping practices manipulate abundance patterns of root and soil microbiome members paving the way to smart farming. Microbiome, 6(1):14."')

2

Right, okay. Looks like they're in there.

In [48]:
flat_list.index('"Cropping practices manipulate abundance patterns of root and soil microbiome members paving the way to smart farming. Microbiome, 6(1):14."')

109

In [49]:
flat_list[100:120]

['2018',
 '"A global meta-analysis of yield stability in organic and conservation agriculture. Nature Communications, 9:3632."',
 '"A global meta-analysis of yield stability in organic and conservation agriculture. Nature Communications, 9:3632."',
 '"One of the primary challenges of our time is to enhance global food production and security. Most assessments in agricultural systems focus on plant yield. Yet, these analyses neglect temporal yield stability, or the variability and reliability of production across years. Here we perform a meta-analysis to assess temporal yield stability of three major cropping systems: organic agriculture and conservation agriculture (no-tillage) vs. conventional agriculture, comparing 193 studies based on 2896 comparisons. Organic agriculture has, per unit yield, a significantly lower temporal stability (−15%) compared to conventional agriculture. Thus, although organic farming promotes biodiversity and is generally more environmentally friendly, future

So I guess I have to do something about the newlines when I read in the file.

In [53]:
with open("data/robin_data.tsv") as f:
    rows = []
    row_id = 0

    for line in f:
        line = line.rstrip("\n")
        row = line.split("\t")

        # Only those with abstracts
        if row[-1] != "" and row[-1] != "\"" and detect(row[-1]) == "en":
            if "\n" in row[-1]:
                row[-1].replace("\n", " ")
            row_id += 1
            row.insert(0, row_id)
            rows.append(row)

len(rows)

547

Hmm. There were 547 rows before I did this -_-

Commence tiny dumb manual experiment.

In [51]:
buggy_rows[0]

[15,
 'Results While microbial richness was marginally affected, we found pronounced cropping effects on community composition, which were specific for the respective microbiomes. Soil bacterial communities were primarily structured by tillage, whereas soil fungal communities responded mainly to management type with additional effects by tillage. In roots, management type was also the driving factor for bacteria but not for fungi, which were generally determined by changes in tillage intensity. To quantify an “effect size” for microbiota manipulation, we found that about 10% of variation in microbial communities was explained by the tested cropping practices. Cropping sensitive microbes were taxonomically diverse, and they responded in guilds of taxa to the specific practices. These microbes also included frequent community members or members co-occurring with many other microbes in the community, suggesting that cropping practices may allow manipulation of influential community member

In [52]:
rows[13]

[14,
 'SDG\xa02.00',
 '"van der Heijden, Marcel G A"',
 'MNF',
 '2018',
 '"Cropping practices manipulate abundance patterns of root and soil microbiome members paving the way to smart farming. Microbiome, 6(1):14."',
 '"Cropping practices manipulate abundance patterns of root and soil microbiome members paving the way to smart farming. Microbiome, 6(1):14."',
 '"Background Harnessing beneficial microbes presents a promising strategy to optimize plant growth and agricultural sustainability. Little is known to which extent and how specifically soil and plant microbiomes can be manipulated through different cropping practices. Here, we investigated soil and wheat root microbial communities in a cropping system experiment consisting of conventional and organic managements, both with different tillage intensities.']

What happens is that the first part of the abstract before the newline gets saved with the right data, and the subsequent parts are cut off and end up floating on their own.

What if I try replacing the newlines before tab-splitting?

Also, `.replace()` returns a value. I forgot to save it.

In [54]:
with open("data/robin_data.tsv") as f:
    rows = []
    row_id = 0

    for line in f:
        line = line.rstrip("\n")
        if "\n" in line:
            line = line.replace("\n", " ")
        row = line.split("\t")

        # Only those with abstracts
        if row[-1] != "" and row[-1] != "\"" and detect(row[-1]) == "en":
            row_id += 1
            row.insert(0, row_id)
            rows.append(row)

len(rows)

547

What the actual fuck. Let's check that the if-condition is even being entered.

In [55]:
with open("data/robin_data.tsv") as f:
    rows = []
    row_id = 0

    for line in f:
        line = line.rstrip("\n")
        if "\n" in line:
            print("something happened")
            line = line.replace("\n", " ")
        row = line.split("\t")

        # Only those with abstracts
        if row[-1] != "" and row[-1] != "\"" and detect(row[-1]) == "en":
            row_id += 1
            row.insert(0, row_id)
            rows.append(row)

len(rows)

547

Let's try taking it out of the if-condition entirely, maybe it will just do nothing if it's not in there anyway.

In [56]:
with open("data/robin_data.tsv") as f:
    rows = []
    row_id = 0

    for line in f:
        line = line.rstrip("\n")
        line = line.replace("\n", " ")
        row = line.split("\t")

        # Only those with abstracts
        if row[-1] != "" and row[-1] != "\"" and detect(row[-1]) == "en":
            row_id += 1
            row.insert(0, row_id)
            rows.append(row)

len(rows)

547

UGH

In [59]:
rows[13:15]

[[14,
  'SDG\xa02.00',
  '"van der Heijden, Marcel G A"',
  'MNF',
  '2018',
  '"Cropping practices manipulate abundance patterns of root and soil microbiome members paving the way to smart farming. Microbiome, 6(1):14."',
  '"Cropping practices manipulate abundance patterns of root and soil microbiome members paving the way to smart farming. Microbiome, 6(1):14."',
  '"Background Harnessing beneficial microbes presents a promising strategy to optimize plant growth and agricultural sustainability. Little is known to which extent and how specifically soil and plant microbiomes can be manipulated through different cropping practices. Here, we investigated soil and wheat root microbial communities in a cropping system experiment consisting of conventional and organic managements, both with different tillage intensities.'],
 [15,
  'Results While microbial richness was marginally affected, we found pronounced cropping effects on community composition, which were specific for the respective

AH THE PROBLEM IS NOT THAT THERE IS A NEWLINE IN THERE, IT'S THAT IT'S SEPARATE STRINGS (looked at the .tsv).

Now what am I gonna do. The erroneous pieces are already being read in as `line`s, so after that point maybe I have to use char literals or Regex to glue them back together before any of the rest of the code.

I should probably also really move on from this whole endeavour, which began just because I wanted a fucking average token count for abstracts in Robin's subset. Maybe I could have done this in the Terminal.