In [9]:
import pandas as pd
import re
import numpy as np

In [2]:
df = pd.read_csv("extracted_quotes.csv", index_col = None)

In [3]:
df.head(5)

Unnamed: 0,quote,source,author,heading_context
0,Un homme heureux est trop content du présent p...,A happy man is too satisfied with the present ...,Albert Einstein,1890s
1,Autoritätsdusel ist der größte Feind der Wahrh...,Blind obedience to authority is the greatest e...,Albert Einstein,1900s
2,Lieber Habicht! / Es herrscht ein weihevolles ...,"Dear Habicht, / Such a solemn air of silence h...",Albert Einstein,1900s
3,E=mc²,The equation originally expressed the equivale...,Albert Einstein,1900s
4,The mass of a body is a measure of its energy ...,Ist die Trägheit eines Körpers von seinem Ener...,Albert Einstein,1900s


Let's perform some exploratory dataset analysis on the dataset

In [4]:
df.shape

(30000, 4)

In [15]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   quote            29984 non-null  object
 1   source           22914 non-null  object
 2   author           30000 non-null  object
 3   heading_context  20671 non-null  object
dtypes: object(4)
memory usage: 937.6+ KB
None


In [17]:
# Missing values
print(df.isna().sum())

quote                16
source             7086
author                0
heading_context    9329
dtype: int64


There are some quotes which are null lets see those.

In [18]:
null_quotes = df[df['quote'].isna() | df['quote'].str.strip().eq('')]

print(f"Number of null/empty quotes: {len(null_quotes)}")
display(null_quotes[['quote', 'author', 'source', 'heading_context']].head(20))

Number of null/empty quotes: 16


Unnamed: 0,quote,author,source,heading_context
13703,,Chinese proverbs,Transliteration (pinyin): Chángjiāng hòulàng t...,Ch
13704,,Chinese proverbs,Transliteration (pinyin): Dú wàn juǎn shū bùrú...,D
13708,,Chinese proverbs,Transliteration (pinyin): Fù zhài zǐ huán. Tra...,F
13709,,Chinese proverbs,Transliteration (pinyin): Hài rén zhī xīn bù k...,H
13712,,Chinese proverbs,"Transliteration (pinyin): Kōngxuéláifēng, wèib...",K
13713,,Chinese proverbs,Transliteration (pinyin): Liángyào kǔkǒu Tradi...,L
13714,,Chinese proverbs,Transliteration: Yǒu qí fù bì yǒu qí zǐ. Havin...,L
13715,,Chinese proverbs,Transliteration (pinyin): Rén suàn bùrú tiān s...,R
13717,,Chinese proverbs,Transliteration (pinyin): Sān gè héshàng méi s...,S
13718,,Chinese proverbs,Transliteration (pinyin): Sǐ mǎ dāng huó mǎ yī...,S


The probelm is that the chinese proverbs are in chinese language and while extracting the quotes we only focused on the english language characters that's why they are null.

Let's see if all the chinese proverbs are like this

In [27]:
chinese_quotes = df[df['author'].str.strip().eq('Chinese proverbs')]

print(f"Number of chinese quotes: {len(chinese_quotes)}")
display(chinese_quotes[['quote', 'author', 'source', 'heading_context']].head(20))

Number of chinese quotes: 61


Unnamed: 0,quote,author,source,heading_context
13699,， ， ， ； 。,Chinese proverbs,Transliteration (pinyin): Bù wén bù ruò wén zh...,B
13700,小洞不补，大洞吃苦,Chinese proverbs,"Transliteration: xiǎo dòng bù bǔ, dà dòng chī ...",B
13701,读书须用意，一字值千金,Chinese proverbs,"Transliteration: dú shū xū yòng yì, yī zì zhí ...",B
13702,宝剑锋从磨砺出，梅花香自苦寒来,Chinese proverbs,"Transliteration: Bǎojiàn fēng cóng mólì chū, m...",B
13703,,Chinese proverbs,Transliteration (pinyin): Chángjiāng hòulàng t...,Ch
13704,,Chinese proverbs,Transliteration (pinyin): Dú wàn juǎn shū bùrú...,D
13705,， ， https: //learnchinesewithabdul. com/chines...,Chinese proverbs,"pinyin: fēng xiàng biàn shí, yǒu rén jìng qián...",F
13706,https: //archive. org/details/diversproverbsw0...,Chinese proverbs,Transliteration (pinyin): Fáng rén zhī xīn bùk...,F
13707,",",Chinese proverbs,"Transliteration (pinyin): Fú wú chóng zhì, huò...",F
13708,,Chinese proverbs,Transliteration (pinyin): Fù zhài zǐ huán. Tra...,F


In [7]:
pd.set_option('display.max_colwidth', None)
print(df.loc[13716, ['quote', 'source']])

quote                                                                                                                                                                                                                                      一
source    Transliteration (pinyin): Ròu bāozi dǎ gǒu 一 qù bù huítóu. Traditional: 肉包子打狗一去不回頭 Simplified: 肉包子打狗一去不回头 To hit a dog with a meat-bun, so it leaves never turning around. Meaning: Punishment gives less incentive than a reward.
Name: 13716, dtype: object


In [41]:
# pd.set_option('display.max_colwidth', None)

df.loc[df['author'] == "Chinese proverbs", 'source']

13699                                                                                                                                                        Transliteration (pinyin): Bù wén bù ruò wén zhī, wén zhī bù ruò jiàn zhī, jiàn zhī bù ruò zhīzhī, zhīzhī bù ruò xíng zhī; xué zhìyú xíng zhī ér zhǐ yǐ. Traditional: 不聞不若聞之，聞之不若見之，見之不若知之，知之不若行之；學至於行之而止矣 Simplified: 不闻不若闻之，闻之不若见之，见之不若知之，知之不若行之；学至于行之而止矣 From Xun Zi (荀子 8. 儒效 23）.
13700    Transliteration: xiǎo dòng bù bǔ, dà dòng chī kǔ A small hole not mended in time will become a big hole much more difficult to mend. English equivalent: "What's past is prologue. " or "A stitch in time saves nine. " Meaning: Fix something while it can be fixed. Don't wait until it's too late to do so. "Destroy the seed of evil, or it will grow up to your ruin. " Aesop, "The Swallow and the Other Birds" (c. 6th century BC)
13701                                                                                     Transliteration: dú shū xū yòng yì, yī z

WHen the source has the chinese alphabet and link attached to it in the source the extractor has avoided it as html link. So it is not able to extract it. But when the source is in english it has successfully extracted it.

Okay, now I understand. Multiple quotes on the chinese proverb page are written as a link and those proverbs have not been extracted and the translation has been mistakenly extracted in the source. 

So we have two options
* to drop the columns for which the quote is NaN value.
* to switch the source and the quote value where the quote is NaN. Because the source contains the translation and in a way it is also a quote.
    we will decide in the preprocessing part

Let's see the quotes which don't have both author and source.

In [14]:
missing_author_source = df[
    (df['author'].isna() | df['author'].str.strip().eq('')) &
    (df['source'].isna() | df['source'].str.strip().eq(''))
]

print(f"Quotes without both author and source: {len(missing_author_source)}")

Quotes without both author and source: 0


There aren't any quotes which don't have both source and author, which is a good thing. Let's check for the null values of each column.

Let's see how many unique authors do we have

In [29]:
unique_authors = df['author'].nunique(dropna=True)

# List of unique authors (optional)
authors_list = df['author'].dropna().unique()

print(f"Number of unique authors: {unique_authors}")
print(authors_list[:20])  # show first 20 authors

Number of unique authors: 368
['Albert Einstein' 'Disputed with Albert Einstein'
 'Misattributed to Albert Einstein' 'unknown' 'Zen proverbs'
 'Wikiquote: Templates' 'Latin proverbs' 'Nero' 'Martin Luther King Jr.'
 'Disputed with Martin Luther King Jr.'
 'Misattributed to Martin Luther King Jr.' 'John Cage' 'Harry S. Truman'
 'Disputed with Harry S. Truman' 'Misattributed to Harry S. Truman'
 'Bertrand Russell' 'Disputed with Bertrand Russell'
 'Misattributed to Bertrand Russell' 'Oscar Wilde'
 'Disputed with Oscar Wilde']


We have 368 unique authors but there is one mistake while extracting these from dump. Albert einstien and misattributed to ALbert Einstien and disputed with albert Einstin are three different authors in the author list. Which is a entity fragmentation.

Let's see who are the authors with most quotes and the authors with the least quotes.

In [30]:
author_counts = df['author'].value_counts().reset_index()
author_counts.columns = ['author', 'quote_count']

# Sort descending (most quotes first)
most_quotes = author_counts.sort_values(by='quote_count', ascending=False)

# Sort ascending (least quotes first)
least_quotes = author_counts.sort_values(by='quote_count', ascending=True)

print("Authors with most quotes:")
print(most_quotes.head(10))  # top 10

print("\nAuthors with least quotes:")
print(least_quotes.head(10))  # bottom 10

Authors with most quotes:
                   author  quote_count
0                 unknown         5342
1              Last words         1051
2        English proverbs          897
3          George W. Bush          674
4       Winston Churchill          504
5     Ralph Waldo Emerson          486
6        Bertrand Russell          480
7  Martin Luther King Jr.          460
8         John F. Kennedy          454
9          Thomas Carlyle          412

Authors with least quotes:
                                       author  quote_count
367          Disputed with Edsger W. Dijkstra            1
337       Misattributed to Edsger W. Dijkstra            1
336             Misattributed to Helen Keller            1
335         Misattributed to William Congreve            1
334             Misattributed to Walt Whitman            1
333             Disputed with Harry S. Truman            1
332            Misattributed to Stephen Crane            1
331               Disputed with H. L. Mencken

The misattributed and disputed we can deal later but the Last words and English Proverbs we should have a look.

Let's see if the source of the English Proverbs and Last Words are meaningful. Because the author does not contain any meaningfull information

In [31]:
# For English proverbs
print("\nEnglish proverbs:")
print(df[df['author'] == "English proverbs"].head(20))



English proverbs:
                                                   quote  \
23386               Absence makes the heart grow fonder.   
23387                       Long absent, soon forgotten.   
23388         The absent are always in the wrong. (1640)   
23389  Accidents will happen in the best families. (1...   
23390                   Actions speak louder than words.   
23391  Admiration: our polite recognition of another ...   
23392            He who does not advance goes backwards.   
23393                Advice most needed is least heeded.   
23394          Advise none to marry or go to war. (1640)   
23395                             Advisers run no risks.   
23396        All is fair in love and war. (17th century)   
23397         All is well that ends well. (14th century)   
23398    America is God's melting-pot. (Israel Zangwill)   
23399  Good riding at two anchors, men have told, for...   
23400  Anger makes dull men witty, but it keeps them ...   
23401  Do not let sun

I can see that the Quotes are scrapped very beautifully but the sources are missing for some I don't know why. I have used the same code to scrape all but for some the source are missing but for some it is present.

Should I revisit the extraction code? For now I think I should not because we dont have any quote for which both author and source are missing. SO collectively they will provide some information about the quote. SO for now I am not going to chaneg the extraction code code

In [32]:
# For English proverbs
print("\nLast Words:")
print(df[df['author'] == "Last words"].head(20))



Last Words:
                                                   quote  \
14083                                        No comment.   
14084                                I did what I could.   
14085                                         Van Halen!   
14086  Come Lord Jesus, come quickly, finish in me th...   
14087  يام السرور التي صفت لي دون تكدير في مدة سلطاني...   
14088  من در حال رفتن هستم و شما مي خواهيد غذا بخورم؟...   
14089                         I don't know. [Attributed]   
14090  שלף חרבך ומותתני פן-יאמרו לי, אשה הרגתהו (Shel...   
14091  May the Most High God preserve thee from destr...   
14092                It's okay! Gun's not loaded. . see?   
14093  Ja, maar niet te veel. |first= Bert |year= 199...   
14094  Oh, yes; it is the glorious Fourth of July. It...   
14095           This is the last of Earth. I am content.   
14096  Principally, and first of all, I recommend my ...   
14097             See in what peace a Christian can die.   
14098                      

In [34]:
pd.set_option('display.max_colwidth', None)
print(df.loc[14088, ['quote', 'source']])

quote                                                                                                       من در حال رفتن هستم و شما مي خواهيد غذا بخورم؟ (Man dar hâl raftan hastam, va shomâ mey khâhid ghazâ békhoram? )
source    Translation: You wish Me to take some food, and I am going? Who: `Abdu'l-Bahá, son of Bahá'u'lláh and one of three central figures of the Bahá'i Faith. Note: Spoken when food was offered to him on his deathbed.
Name: 14088, dtype: object


From the chinese proverb source, last words and english proverb's sources it is confirmed that if the quote is in another language or has translation then the source contains everything.

It is upto us to whether to trim the source or keep it as it is in the datbase. It will depend on the impact it has on the size of the graph database.

Let's analyse the heading context, how many unique values it has.

In [37]:
unique_headings = df['heading_context'].nunique(dropna=True)

# List of unique authors (optional)
heading_context_list = df['heading_context'].dropna().unique()

print(f"Number of unique headings: {unique_headings}")
print(heading_context_list[:20])  # show first 20 authors

Number of unique headings: 1724
['1890s' '1900s' '1910s' 'Principles of Research (1918)' '1920s'
 'Sidelights on Relativity (1922)' 'Viereck interview (1929)' '1930s'
 'Wisehart interview (1930)' 'Religion and Science (1930)'
 'What I Believe (1930)' 'Mein Weltbild (My World-view) (1931)'
 'My Credo (1932)' '1933' '1934' 'Obituary for Emmy Noether (1935)'
 'Why Do They Hate the Jews (1938)' '1940s' 'Science and Religion (1941)'
 'Only Then Shall We Find Courage (1946)']


This seems pretty clean.

Now let's see the length of the quotes and the sources

In [42]:
df['quote_length'] = df['quote'].apply(lambda x: len(str(x)))
df['source_length'] = df['source'].apply(lambda x: len(str(x)))

# Preview
df[['quote_length', 'source_length']].describe()


Unnamed: 0,quote_length,source_length
count,30000.0,30000.0
mean,328.241067,90.450633
std,392.587922,167.595513
min,1.0,1.0
25%,83.0,5.0
50%,187.0,40.0
75%,433.0,112.25
max,14814.0,4356.0


* The longest quotes is 14814 char long and the shortest quote is 1 char long.
* The longest source is 4356 char long and the shortest source is 1 char long.
Let's see them.

In [82]:
# Longest quotes
df.nlargest(5, 'quote_length')[['quote','author','source' ,'heading_context','quote_length']]

Unnamed: 0,quote,author,source,heading_context,quote_length
9023,"Judge: Do you want Mr. Bryan sworn? :Darrow: No. :Bryan: I can make affirmation; I can say ""So help me God, I will tell the truth. "": Darrow: No, I take it you will tell the truth, Mr. Bryan. You have given considerable study to the Bible, haven't you, Mr. Bryan? :Bryan: Yes, sir, I have tried to. :Darrow: Then you have made a general study of it? :Bryan: Yes, I have; I have studied the Bible for about 50 years, or sometime more than that, but, of course, I have studied it more as I have become older than when I was but a boy. :Darrow: You claim that everything in the Bible should be literally interpreted? :Bryan: I believe everything in the Bible should be accepted as it is given there: some of the Bible is given illustratively. For instance: ""Ye are the salt of the earth. "" I would not insist that man was actually salt, or that he had flesh of salt, but it is used in the sense of salt as saving God's people. :Darrow: But when you read that Jonah swallowed the whale--or that the whale swallowed Jonah--excuse me please--how do you literally interpret that? :Bryan: When I read that a ""big fish"" swallowed Jonah--it does not say whale. That is my recollection of it. A big fish, and I believe it, and I believe in a God who can make a whale and can make a man and make both what He pleases. :Darrow: Now, you say, the big fish swallowed Jonah, and he there remained how long--three days--and then he spewed him upon the land. You believe that the big fish was made to swallow Jonah? :Bryan: I am not prepared to say that; the Bible merely says it was done. :Darrow: You don't know whether it was the ordinary run of fish, or made for that purpose? :Bryan: You may guess; you evolutionists guess. . :Darrow: You are not prepared to say whether that fish was made especially to swallow a man or not? :Bryan: The Bible doesn't say, so I am not prepared to say. :Darrow: But do you believe He made them--that He made such a fish and that it was big enough to swallow Jonah? :Bryan: Yes, sir. Let me add: One miracle is just as easy to believe as another. :Darrow: Just as hard? :Bryan: It is hard to believe for you, but easy for me. A miracle is a thing performed beyond what man can perform. When you get within the realm of miracles; and it is just as easy to believe the miracle of Jonah as any other miracle in the Bible. :Darrow: Perfectly easy to believe that Jonah swallowed the whale? :Bryan: If the Bible said so; the Bible doesn't make as extreme statements as evolutionists do. :Darrow: The Bible says Joshua commanded the sun to stand still for the purpose of lengthening the day, doesn't it, and you believe it. :Bryan: I do. :Darrow: Do you believe at that time the entire sun went around the earth? :Bryan: No, I believe that the earth goes around the sun. :Darrow: Do you believe that the men who wrote it thought that the day could be lengthened or that the sun could be stopped? :Bryan: I don't know what they thought. :Darrow: You don't know? :Bryan: I think they wrote the fact without expressing their own thoughts. :Darrow: Have you an opinion as to whether or not the men who wrote that thought--: Thomas Stewart: (a prosecution lawyer)--I want to object, your honor. It has gone beyond the pale of any issue that could possibly be injected into this lawsuit, except by imagination. I do not think the defendant has a right to conduct the examination any further and I ask your honor to exclude it. :Bryan: It seems to me it would be too exacting to confine the defense to the facts. If they are not allowed to get away from the facts, what have they to deal with? :Judge: Mr. Bryan is willing to be examined. Go ahead. :Darrow: Can you answer my question directly? If the day was lengthened by stopping either the earth or the sun, it must have been the earth? :Bryan: Well, I should say so. :Darrow: Now, Mr. Bryan, have you ever pondered what would have happened to the earth if it had stood still? :Bryan: No. :Darrow: You have not? :Bryan: No; the God I believe in could have taken care of that, Mr. Darrow. :Darrow: I see. Have you ever pondered what would naturally happen to the earth if it stood still suddenly? :Bryan: No. :Darrow: Don't you know it would have been converted into molten mass of matter? :Bryan: You testify to that when you get on the stand, I will give you a chance. :Darrow: Don't you believe it? :Bryan: I would want to hear expert testimony on that. :Darrow: You have never investigated that subject? :Bryan: I don't think I have ever had the question asked. :Darrow: Or ever thought of it? :Bryan: I have been too busy on things that I thought were of more importance. :Darrow: You believe the story of the flood to be a literal interpretation? :Bryan: Yes, sir. :Darrow: When was that flood? :Bryan: I would not attempt to fix the date. The date is fixed, as suggested this morning. :Darrow: About 4004 B. C. ?: Bryan: That has been the estimate of a man that is accepted today. [A witness had testified on Bishop Ussher's theory that the Earth was formed in 4004 B. C. ] I would not say it is accurate. :Darrow: That estimate is printed in the Bible? :Bryan: Everybody knows, at least, I think most of the people know, that was the estimate given. :Darrow: But what do you think that the Bible itself says? Don't you know how it was arrived at? :Bryan: I never made a calculation. :Darrow: A calculation from what? :Bryan: I could not say. :Darrow: From the generations of man? :Bryan: I would not want to say that. :Darrow: What do you think? :Bryan: I do not think about things I don't think about. :Darrow: Do you think about things you do think about? :Bryan: Well, sometimes. (Laughter. ): Policeman Deputy Clason: Let us have order. . .: Thomas Stewart: {prosecution attorney}--Your honor, he is perfectly able to take care of this, but we are attaining no evidence. This is not competent evidence. :Bryan: These gentlemen have not had much chance--they did not come here to try this case. They came here to try revealed religion. I am here to defend it and they can ask me any question they please. :Judge: All right. (Applause. ): Darrow: Great applause from the bleachers. :Bryan: Darrow--I have never called them yokels. :Bryan: That is the ignorance of Tennessee, the bigotry. :Darrow: You mean who are applauding you? (Applause. ): Bryan: Those are the people whom you insult. :Darrow: You insult every man of science and learning in the world because he does believe in your fool religion. :Judge: I will not stand for that. :Darrow: For what he is doing? :Judge: I am talking to both of you. :Darrow: Do you know anything about how many people there were in Egypt 3, 500 years ago, or how many people there were in China 5, 000 years ago? :Bryan: No. :Darrow: Have you ever tried to find out? :Bryan: No, sir. You are the first man I ever heard of who has been interested in it. (Laughter. ): Darrow: Mr. Bryan, am I the first man you ever heard of who has been interested in the age of human societies and primitive man? :Bryan: You are the first man I ever heard speak of the number of people at those different periods. :Darrow: Where have you lived all your life? :Bryan: Not near you. (Laughter and applause. ): Darrow: Nor near anybody of learning? :Bryan: Oh, don't assume you know it all. :Darrow: Do you know there are thousands of books in our libraries on all those subjects I have been asking you about? :Bryan: I couldn't say, but I will take your word for it. . .: Darrow: Have you any idea how old the earth is? :Bryan: No. :Darrow: The book you have introduced in evidence tells you, doesn't it? :Bryan: I don't think it does, Mr. Darrow. :Darrow: Let's see whether it does; is this the one? :Bryan: That is the one, I think. :Darrow: It says B. C. 4004? :Bryan: That is Bishop Ussher's calculation. :Darrow: That is printed in the Bible you introduced? :Bryan: Yes, sir. :Darrow: Would you say that the earth was only 4, 000 years old? :Bryan: Oh, no; I think it is much older than that. :Darrow: How much? :Bryan: I couldn't say. :Darrow: Do you say whether the Bible itself says it is older than that? :Bryan: I don't think it is older or not. :Darrow: Do you think the earth was made in six days? :Bryan: Not six days of 24 hours. :Darrow: Doesn't it say so? :Bryan: No, sir. :Judge: Are you about through, Mr. Darrow? :Darrow: I want to ask a few more questions about the creation. :Judge: I know. We are going to adjourn when Mr. Bryan comes off the stand for the day. Be very brief, Mr. Darrow. Of course, I believe I will make myself clearer. Of course, it is incompetent testimony before the jury. The only reason I am allowing this to go in at all is that they may have it in the appellate court as showing what the affidavit would be. :Bryan: The reason I am answering is not for the benefit of the superior court. It is to keep these gentlemen from saying I was afraid to meet them and let them question me, and I want the Christian world to know that any atheist, agnostic, unbeliever, can question me anytime as to my belief in God, and I will answer him. :Darrow: I want to take an exception to this conduct of this witness. He may be very popular down here in the hills--: Bryan: Your honor, they have not asked a question legally and the only reason they have asked any question is for the purpose, as the question about Jonah was asked, for a chance to give this agnostic an opportunity to criticize a believer in the world of God; and I answered the question in order to shut his mouth so that he cannot go out and tell his atheistic friends that I would not answer his questions. That is the only reason, no more reason in the world. :Malone: (another defense counsel) Your honor on this very subject, I would like to say that I would have asked Mr. Bryan, and I consider myself as good a Christian as he is, every question that Mr. Darrow has asked him for the purpose of bringing out whether or not there is to be taken in this court a literal interpretation of the Bible, or whether, obviously, as these questions indicate, if a general and literal construction cannot be put upon the parts of the Bible which have been covered by Mr. Darrow's questions. I hope for the last time no further attempt will be made by counsel on the other side of the case, or Mr. Bryan, to say the defense is concerned at all with Mr. Darrow's particular religious views or lack of religious views. We are here as lawyers with the same right to our views. I have the same right to mine as a Christian as Mr. Bryan has to his, and we do not intend to have this case charged by Mr. Darrow's agnosticism or Mr. Bryan's brand of Christianity. (A great applause. ): Darrow: Mr. Bryan, do you believe that the first woman was Eve? :Bryan: Yes. :Darrow: Do you believe she was literally made out of Adam's rib? :Bryan: I do. :Darrow: Did you ever discover where Cain got his wife? :Bryan: No, sir. I leave the agnostics to hunt for her. :Darrow: You have never found out? :Bryan: I have never tried to find out. :Darrow: You have never tried to find out? :Bryan: No. :Darrow: The Bible says he got one, doesn't it? Were there other people on the earth at that time? :Bryan: I cannot say. :Darrow: You cannot say. Did that ever enter your consideration? :Bryan: Never bothered me. :Darrow: There were no others recorded, but Cain got a wife. :Bryan: That is what the Bible says. :Darrow: Where she came from you do not know. All right. Does the statement, ""The morning and the evening were the first day, "" and ""The morning and the evening were the second day, "" mean anything to you? :Bryan: I do not think it necessarily means a 24-hour day. :Darrow: You do not? :Bryan: No. :Darrow: What do you consider it to be? :Bryan: I have not attempted to explain it. If you will take the second chapter--let me have the book. [Reaches for a Bible. ] The fourth verse of the second chapter says: ""These are the generations of the heavens and of the earth, when they were created in the day that the Lord God made the earth and the heavens, "" the word day there in the very next chapter is used to describe a period. I do not see that there is any necessity for construing the words, ""the evening and the morning, "" as meaning necessarily a 24-hour day, ""in the day when the Lord made the heaven and the earth. "": Darrow: Then, when the Bible said, for instance, ""and God called the firmament heaven. And the evening and the morning were the second day, "" that does not necessarily mean twenty-four hours? :Bryan: I do not think it necessarily does. :Darrow: Do you think it does or does not? :Bryan: I know a great many think so. :Darrow: What do you think? :Bryan: I do not think it does. :Darrow: You think those were not literal days? :Bryan: I do not think they were twenty-four-hour days. :Darrow: What do you think about it? :Bryan: That is my opinion--I do not know that my opinion is better on that subject than those who think it does. :Darrow: You do not think that? :Bryan: No. But I think it would be just as easy for the kind of God we believe in to make the earth in six days as in six years or in 6 million years or in 600 million years. I do not think it important whether we believe one or the other. :Darrow: Do you think those were literal days? :Bryan: My impression is they were periods, but I would not attempt to argue against anybody who wanted to believe in literal days. :Darrow: I will read it to you from the Bible: ""And the Lord God said unto the serpent, because thou hast done this, thou art cursed above all cattle, and above every beast of the field; upon thy belly shalt thou go and dust shalt thou eat all the days of thy life. "" Do you think that is why the serpent is compelled to crawl upon its belly? :Bryan: I believe that. :Darrow: Have you any idea how the snake went before that time? :Bryan: No, sir. :Darrow: Do you know whether he walked on his tail or not? :Bryan: No, sir. I have no way to know. (Laughter. ): Darrow: Now, you refer to the cloud that was put in heaven after the flood, the rainbow. Do you believe in that? :Bryan: Read it. :Darrow: All right, Mr. Bryan, I will read it for you. :Bryan: Your Honor, I think I can shorten this testimony. The only purpose Mr. Darrow has is to slur at the Bible, but I will answer his question. I will answer it all at once, and I have no objection in the world. I want the world to know that this man, who does not believe in a God, is trying to use a court in Tennessee to slur at it, and while it will require time, I am willing to take it. :Darrow: I object to your statement. I am examining you on your fool ideas that no intelligent Christian on earth believes. :Judge: Court is adjourned until 9 o'clock tomorrow morning. :* Clarence Darrow's examination of William Jennings Bryan at the 1925 Scopes trial Scopes Trial Day 7",Clarence Darrow,,Scopes Trial (1925),14814
26177,"All that is necessary for the triumph of evil is that good men do nothing. :: This purported quote bears a resemblance to the narrated theme of Sergei Bondarchuk's Soviet film adaptation of Leo Tolstoy's War and Peace, produced in 1966. In it the narrator declares ""All that is necessary for evil to triumph is for good men to do nothing"", although since the original is in Russian various translations to English are possible. This purported quote also bears resemblance to a quote misattributed to Plato (Respectfully Quoted: A Dictionary of Quotations) ""The penalty good men pay for indifference to public affairs is to be ruled by evil men. "" It also bears resemblance to what Albert Einstein wrote as part of his tribute to Pablo Casals: ""The world is in greater peril from those who tolerate or encourage evil than from those who actually commit it. "": : More research done on this matter is available at these two links: Burkequote & Burkequote2 — as the information at these links indicates, there are many variants of this statement, probably because there is no known original by Burke. In addition, an exhaustive examination of this quote has been done at the following link: QuoteInvestigator. {| class=""wikitable collapsible collapsed""! Selected examples of variations of ""All that is necessary. . "" |- |: All that is necessary for evil to triumph is for good men to do nothing: All that is necessary for the triumph of evil is for good men to do nothing: All that is necessary for evil to triumph is that good men do nothing: All that is necessary for evil to triumph is for a few good men to do nothing: All that is necessary for the triumph of evil is for a few good men to do nothing: All that is necessary for the triumph of evil is for some good men to do nothing: All that is necessary for evil to triumph is for all good men to do nothing: All that is necessary for evil to triumph is for enough good men to do nothing: All that is necessary for the triumph of evil is that enough good men do nothing: All that is essential for the triumph of evil is that good men do nothing: All that is needed for the triumph of evil is for good men to do nothing: All that is needed for the triumph of evil is that good men do nothing: All that is needed for evil to triumph is for good men to do nothing: All that is needed for evil to triumph is that good men do nothing: All that is needed for the triumph of evil is for enough good men to do nothing: All that is needed for the forces of evil to triumph is for enough good men to do nothing: All that is required for the triumph of evil is that good men do nothing: All that is required for the triumph of evil is for good men to do nothing: All that is required for evil to triumph is for good men to do nothing: All that is required for evil to triumph is that good men do nothing: The only thing necessary for the triumph of evil is for good men to do nothing: The only thing necessary for evil to triumph is for good men to do nothing: The only thing necessary for the triumph of evil is that good men do nothing: The only thing necessary for evil to triumph is for good men to do nothing: The only thing required for evil to triumph is for good men to do nothing: The only thing needed for evil to triumph is that good men do nothing: The only thing needed for the triumph of evil is for good men to do nothing: The only thing that is necessary for the triumph of evil is for good men to do nothing: The only thing that is necessary for the triumph of evil is that good men do nothing: All that it takes for evil to triumph is for good men to do nothing: All it takes for evil to triumph is for good men to do nothing: All that's necessary for the triumph of evil is for good men to do nothing: All that's needed for the forces of evil to triumph is for enough good men to do nothing: All that's needed for the triumph of evil is for good men to do nothing: All that is necessary for evil to succeed is for good men to do nothing: For evil to prosper all it needs is for good people to do nothing: All that is necessary for the forces of evil to win in the world is for enough good men to do nothing: All that's necessary for the forces of evil to win in the world is for enough good men to do nothing: All that is required for evil to triumph is for good [wo]men to do nothing: The only thing needed for evil to triumph is for enough good men [and women] to do nothing: The only thing required for evil to triumph is for good men (and women! ) to do nothing: All that is necessary for evil to triumph is that good men (and women) do nothing: All that is necessary for the triumph of evil is that good men (and women) do nothing: For evil to triumph it is necessary only that good men [and women] do nothing: All that is necessary for evil to triumph is for good men and women to do nothing: All that it takes for the triumph of evil is that good men and women do nothing: The only thing necessary for the triumph of evil is for good men and women to do nothing: All it takes for evil to triumph is for good people to do nothing: All that is necessary for evil to triumph is for good people to do nothing: All that is necessary for the triumph of evil is that good people do nothing: All that needs to be done for evil to prevail is that good men do nothing: The only thing that has to happen in this world for evil to triumph is for good men to do nothing: All that is necessary for evil to triumph in the world is for enough good men and women to do nothing: Evil thrives when good men do nothing: For evil to triumph good men need do nothing: For evil to triumph good men have to do nothing: The best way for evil to triumph is for good men to do nothing: The surest way to assure the triumph of evil is for good men to do nothing: Evil will triumph so long as good men do nothing: It is necessary only for good men to say nothing for evil to triumph: It is necessary only for the good man to do nothing for evil to triumph: For evil to triumph it is necessary for good men to do nothing: For evil to triumph it is sufficient for good men to do nothing: All it takes for evil to triumph is for good men to stand by and do nothing: All that is necessary for the forces of evil to triumph is for good men to do nothing: All that is required for evil to triumph over good is for good men to do nothing: Evil can triumph only if good men do nothing: The only thing evil men need to triumph is for good men to do nothing: The only thing for the triumph of evil is for good men to do nothing: The only thing necessary for the triumph of evil is for enough good men to do nothing: The only thing necessary for the triumph of evil is that good men stand by and do nothing: The only thing necessary for the triumph of evil was for good men to do nothing: The only way for evil to triumph is for good men to do nothing: Evil prevails when good men do nothing |}",Misattributed to Edmund Burke,"This is probably the most quoted statement attributed to Burke, and an extraordinary number of variants of it exist, but all without any definite original source. They closely resemble remarks known to have been made by the Utilitarian philosopher John Stuart Mill, in an address at the University of St. Andrews (1 February 1867): Bad men need nothing more to compass their ends, than that good men should look on and do nothing. The very extensively used remarks attributed to Burke might be based on a paraphrase of some of his ideas, but he is not known to have ever declared them in so succinct a manner in any of his writings. It has been suggested that they may have been adapted from these lines of Burke's in his Thoughts on the Cause of the Present Discontents (1770): ""When bad men combine, the good must associate; else they will fall one by one, an unpitied sacrifice in a contemptible struggle. "" (see above)",,6921
15543,"We stand for a living wage. Wages are subnormal if they fail to provide a living for those who devote their time and energy to industrial occupations. The monetary equivalent of a living wage varies according to local conditions, but must include enough to secure the elements of a normal standard of living--a standard high enough to make morality possible, to provide for education and recreation, to care for immature members of the family, to maintain the family during periods of sickness, and to permit of reasonable saving for old age. Hours are excessive if they fail to afford the worker sufficient time to recuperate and return to his work thoroughly refreshed. We hold that the night labor of women and children is abnormal and should be prohibited; we hold that the employment of women over forty-eight hours per week is abnormal and should be prohibited. We hold that the seven day working week is abnormal, and we hold that one day of rest in seven should be provided in law. We hold that the continuous industries, operating twenty-four hours out of twenty-four, are abnormal, and where, because of public necessity or of technical reasons (such as molten metal), the twenty-four hours must be divided into two shifts of twelve hours or three shifts of eight, they should by law be divided into three of eight. Safety conditions are abnormal when, through unguarded machinery, poisons, electrical voltage, or otherwise, the workers are subjected to unnecessary hazards of life and limb; and all such occupations should come under governmental regulation and control. Home life is abnormal when tenement manufacture is carried on in the household. It is a serious menace to health, education, and childhood, and should therefore be entirely prohibited. Temporary construction camps are abnormal homes and should be subjected to governmental sanitary regulation. The premature employment of children is abnormal and should be prohibited; so also the employment of women in manufacturing, commerce, or other trades where work compels standing constantly; and also any employment of women in such trades for a period of at least eight weeks at time of childbirth. Our aim should be to secure conditions which will tend everywhere towards regular industry, and will do away with the necessity for rush periods, followed by out-of-work seasons, which put so severe a strain on wage-workers. It is abnormal for any industry to throw back upon the community the human wreckage due to its wear and tear, and the hazzards of sickness, accident, invalidism, involuntary unemployment, and old age should be provided for through insurance. This should be made a charge in whole or in part upon the industries the employer, the employee, and perhaps the people at large, to contribute severally in some degree. Wherever such standards are not met by given establishments, by given industries, are unprovided for by a legislature, or are balked by unenlightened courts, the workers are in jeopardy, the progressive employer is penalized, and the community pays a heavy cost in lessened efficiency and in misery. What Germany has done in the way of old age pensions or insurance should be studied by us, and the system adapted to our uses, with whatever modifications are rendered necessary by our different ways of life and habits of thought. Workingwomen have the same need to combine for protection that workingmen have; the ballot is as necessary for one class as for the other; we do not believe that with the two sexes there is identity of function; but we do believe that there should be equality of right; and therefore we favor woman suffrage. In those conservative States where there is genuine doubt how the women stand on this matter I suggest that it be referred to a vote of the women, so that they may themselves make the decision. Surely if women could vote, they would strengthen the hands of those who are endeavoring to deal in efficient fashion with evils such as the white slave traffic; evils which can in part be dealt with Nationally, but which in large part can be reached only by determined local action, such as insisting on the widespread publication of the names of the owners, the landlords, of houses used for immoral purposes. No people are more vitally interested than workingmen and workingwomen in questions affecting the public health. The pure food law must be strengthened and efficiently enforced. In the National Government one department should be intrusted with all the agencies relating to the public health, from the enforcement of the pure food law to the administration of quarantine. This department, through its special health service, would co-operate intelligently with the various State and municipal bodies established for the same end. There would be no discrimination against or for any one set of therapeutic methods, against or for any one school of medicine or system of healing; the aim would be merely to secure under one administrative body efficient sanitary regulation in the interest of the people as a whole.",Theodore Roosevelt,,1910s,5071
424,"Being a lover of freedom, when the revolution came in Germany, I looked to the universities to defend it, knowing that they had always boasted of their devotion to the cause of truth; but, no, the universities immediately were silenced. Then I looked to the great editors of the newspapers whose flaming editorials in days gone by had proclaimed their love of freedom; but they, like the universities, were silenced in a few short weeks. Then I looked to individual writers who, as literary guides of Germany, had written much and often concerning the place of freedom in modern life; but they, too, were mute. Only the church stood squarely across the path of Hitler's campaign for suppressing truth. I never had any special interest in the church before, but now I feel a great affection and admiration because the church alone has had the courage and persistence to stand for intellectual truth and moral freedom. I am forced thus to confess that what I once despised I now praise unreservedly. :: In his original statement Einstein was probably referring to the actions of the Emergency Covenant of Pastors organized by Martin Niemöller, and the Confessing Church which he and other prominent churchmen such as Karl Barth and Dietrich Bonhoeffer established in opposition to Nazi policies. :: Einstein also made some scathingly negative comments about the behavior of the Church under the Nazi regime (and its behavior towards Jews throughout history) in a 1943 conversation with William Hermanns recorded in Hermanns' book Einstein and the Poet (1983). On p. 63 Hermanns records him saying ""Never in history has violence been so widespread as in Nazi Germany. The concentration camps make the actions of Genghis Khan look like child's play. But what makes me shudder is that the Church is silent. One doesn't need to be a prophet to say, 'The Catholic Church will pay for this silence. ' Dr. Hermanns, you will live to see that there is moral law in the universe. . .There are cosmic laws, Dr. Hermanns. They cannot be bribed by prayers or incense. What an insult to the principles of creation. But remember, that for God a thousand years is a day. This power maneuver of the Church, these Concordats through the centuries with worldly powers. . the Church has to pay for it. We live now in a scientific age and in a psychological age. You are a sociologist, aren't you? You know what the Herdenmenschen (men of herd mentality) can do when they are organized and have a leader, especially if he is a spokesmen for the Church. I do not say that the unspeakable crimes of the Church for 2000 years had always the blessings of the Vatican, but it vaccinated its believers with the idea: We have the true God, and the Jews have crucified Him. The Church sowed hate instead of love, though the Ten Commandments state: Thou shalt not kill. "" And then on p. 64: ""I'm not a Communist but I can well understand why they destroyed the Church in Russia. All the wrongs come home, as the proverb says. The Church will pay for its dealings with Hitler, and Germany, too. "" And on p. 65: ""I don't like to implant in youth the Church's doctrine of a personal God, because that Church has behaved so inhumanely in the past 2000 years. The fear of punishment makes the people march. Consider the hate the Church manifested against the Jews and then against the Muslims, the Crusades with their crimes, the burning stakes of the Inquisition, the tacit consent of Hitler's actions while the Jews and the Poles dug their own graves and were slaughtered. And Hitler is said to have been an altar boy! The truly religious man has no fear of life and no fear of death—and certainly no blind faith; his faith must be in his conscience. . . I am therefore against all organized religion. Too often in history, men have followed the cry of battle rather than the cry of truth. "" When Hermanns asked him ""Isn't it only human to move along the line of least resistance? "", Einstein responded ""Yes. It is indeed human, as proved by Cardinal Pacelli, who was behind the Concordat with Hitler. Since when can one make a pact with Christ and Satan at the same time? And he is now the Pope! The moment I hear the word 'religion', my hair stands on end. The Church has always sold itself to those in power, and agreed to any bargain in return for immunity. It would have been fine if the spirit of religion had guided the Church; instead, the Church determined the spirit of religion. Churchmen through the ages have fought political and institutional corruption very little, so long as their own sanctity and church property were preserved. """,Misattributed to Albert Einstein,"Attributed in ""The Conflict Between Church And State In The Third Reich"", by S. Parkes Cadman, La Crosse Tribune and Leader-Press (28 October 1934), viewable online on p. 9 of the issue here (double-click the page to zoom). The quote is preceded by ""In this connection it is worth quoting in free translation a statement made by Professor Einstein last year to one of my colleagues who has been prominently identified with the Protestant church in its contacts with Germany. "" [Emphasis added. ] While based on something that Einstein said, Einstein himself stated that the quote was not an accurate record of his words or opinion. After the quote appeared in Time magazine (23 December 1940), p. 38, a minister in Harbor Springs, Michigan wrote to Einstein to check if the quote was real. Einstein wrote back ""It is true that I made a statement which corresponds approximately with the text you quoted. I made this statement during the first years of the Nazi-Regime — much earlier than 1940 — and my expressions were a little more moderate. "" (March 1943) In a later letter to Rev. Cornelius Greenway of Brooklyn, who asked if Einstein would write out the statement in his own hand, Einstein was more vehement in his repudiation of the statement (14 November 1950): The wording of the statement you have quoted is not my own. Shortly after Hitler came to power in Germany I had an oral conversation with a newspaper man about these matters. Since then my remarks have been elaborated and exaggerated nearly beyond recognition. I cannot in good conscience write down the statement you sent me as my own. The matter is all the more embarrassing to me because I, like yourself, I am predominantly critical concerning the activities, and especially the political activities, through history of the official clergy. Thus, my former statement, even if reduced to my actual words (which I do not remember in detail) gives a wrong impression of my general attitude.",,4614
21986,"In those days I had seen little further than the old school of political economists into the possibilities of fundamental improvement in social arrangements. Private property, as now understood, and inheritance, appeared to me, as to them, the dernier mot of legislation: and I looked no further than to mitigating the inequalities consequent on these institutions, by getting rid of primogeniture and entails. The notion that it was possible to go further than this in removing the injustice—for injustice it is, whether admitting of a complete remedy or not—involved in the fact that some are born to riches and the vast majority to poverty, I then reckoned chimerical, and only hoped that by universal education, leading to voluntary restraint on population, the portion of the poor might be made more tolerable. In short, I was a democrat, but not the least of a Socialist. We were now much less democrats than I had been, because so long as education continues to be so wretchedly imperfect, we dreaded the ignorance and especially the selfishness and brutality of the mass: but our ideal of ultimate improvement went far beyond Democracy, and would class us decidedly under the general designation of Socialists. While we repudiated with the greatest energy that tyranny of society over the individual which most Socialistic systems are supposed to involve, we yet looked forward to a time when society will no longer be divided into the idle and the industrious; when the rule that they who do not work shall not eat, will be applied not to paupers only, but impartially to all; when the division of the produce of labour, instead of depending, as in so great a degree it now does, on the accident of birth, will be made by concert on an acknowledged principle of justice; and when it will no longer either be, or be thought to be, impossible for human beings to exert themselves strenuously in procuring benefits which are not to be exclusively their own, but to be shared with the society they belong to. The social problem of the future we considered to be, how to unite the greatest individual liberty of action, with a common ownership in the raw material of the globe, and an equal participation of all in the benefits of combined labour. We had not the presumption to suppose that we could already foresee, by what precise form of institutions these objects could most effectually be attained, or at how near or how distant a period they would become practicable. We saw clearly that to render any such social transformation either possible or desirable, an equivalent change of character must take place both in the uncultivated herd who now compose the labouring masses, and in the immense majority of their employers. Both these classes must learn by practice to labour and combine for generous, or at all events for public and social purposes, and not, as hitherto, solely for narrowly interested ones. But the capacity to do this has always existed in mankind, and is not, nor is ever likely to be, extinct. Education, habit, and the cultivation of the sentiments, will make a common man dig or weave for his country, as readily as fight for his country. True enough, it is only by slow degrees, and a system of culture prolonged through successive generations, that men in general can be brought up to this point. But the hindrance is not in the essential constitution of human nature. Interest in the common good is at present so weak a motive in the generality not because it can never be otherwise, but because the mind is not accustomed to dwell on it as it dwells from morning till night on things which tend only to personal advantage. When called into activity, as only self-interest now is, by the daily course of life, and spurred from behind by the love of distinction and the fear of shame, it is capable of producing, even in common men, the most strenuous exertions as well as the most heroic sacrifices. The deep-rooted selfishness which forms the general character of the existing state of society, is so deeply rooted, only because the whole course of existing institutions tends to foster it; modern institutions in some respects more than ancient, since the occasions on which the individual is called on to do anything for the public without receiving its pay, are far less frequent in modern life, than the smaller commonwealths of antiquity.",John Stuart Mill,(pp. 230-233),Ch. 7: General View of the Remainder of My Life.,4382


Oh, the first one is a court trial so that's why it is very long.

In [47]:
# Longest sources
df.nlargest(5, 'source_length')[['quote' , 'author' , 'source', 'source_length']]

Unnamed: 0,quote,author,source,source_length
419,We cannot solve the problems using the same kind of thinking we used when we created them,Disputed with Albert Einstein,"""Einstein's famous saying in Copenhagen"", as quoted in a FBIS Daily Report: East Europe (4 April 1995), p. 45 May have originated from Einstein's 25 May 1946 telegram quoted in this New York Times story, where he wrote ""The unleashed power of the atom has changed everything save our modes of thinking and we thus drift toward unparalleled catastrophe"", along with a later comment ""We need two hundred thousand dollars at once for a nation-wide campaign to let the people know that a new type of thinking is essential if mankind is to survive and move toward higher levels. "" The 1959 English translation of Hans Hellmut Kirst's The Seventh Day modified the two quotes and left out the intermediate part about funding for a nation-wide campaign: ""The unleashed power of the atom has changed everything except our ways of thinking. Thus we are drifting toward a catastrophe beyond comparison. We shall require a substantially new manner of thinking if mankind is to survive"" (the original German version from 1957, titled Keinner Kommt Davon, has the quote as ""Die entfesselte Macht des Atoms hat alles verändert, nur nicht unsere Denkweisen. Auf diese Weise gleiten wir einer Katastrophe ohnegleichen entgegen. Wir brauchen eine wesentlich neue Denkungsart, wenn die Menschheit am Leben bleiben soll. "") This version is quoted verbatim in later sources like p. 23 of a statement given by Dr. Charles E. Osgood to the US Committee on Foreign relations on 25 May 1966, and a March 1979 story on p. 82 of The Bulletin of Atomic Scientists. Some also gave shortened versions which omitted the reference to ""the unleashed power of the atom"", as in the 1967 English translation of Josué de Castro's 1961 book The Black Book of Hunger, which on p. 4 attributed to Einstein the quote ""It has become essential that mankind formulate a new mode of thinking if it wishes to survive and to reach a higher level. "" Ram Dass' 1974 book The Only Dance There Is, which consisted of transcripts of talks he had given in 1970 and 1972, on p. 38 attributed to Einstein the quote ""The world that we have made as a result of the level of thinking we have done thus far creates problems that we cannot solve at the same level as the level we created them at. "" Ram Dass' speeches were generally given without notes so he may have just been paraphrasing one of the earlier versions of the quote, but some later sources repeated this exactly or with only a word or two different, like p. 291 of David Dellinger's 1975 book More Power Than We Know, or p. 136 of the 1977 book The End of the Road: A Citizen's Guide to Transportation Problem-Solving edited by Robert Golten et al. , or p. 47 of the 1986 edition of James L. Christian's book Philosophy: An Introduction to the Art of Wondering. Later authors gave shorter variants, like The 1986 Annual: Developing Human Resources which on p. 185 attributed to Einstein the statement ""Our thinking has created problems which cannot be solved by that same level of thinking"", or the 1988 book Take This Job and Love It by Dennis Jaffe and Cynthia Scott which on p. 60 attributed the quote ""the significant problems we have cannot be solved at the same level of thinking we were at when we created them"" (a nearly identical version with 'we face' instead of 'we have' can be found on p. 42 of Stephen Covey's 1989 book The Seven Habits of Highly Effective People), and the 1989 book Living in Love With Yourself by Barry Ellsworth, which on p. 27 attributed the quote ""the problems our world faces can never be solved by the same type of thinking that created them"", and Lynne Garnett's 1990 book Finding the Great Creative You: A Seven Step Adventure which on p. 168 attributed the quote ""You can't solve a problem using the same kind of thinking that got you into the problem in the first place. "" An August 1993 post from alt. quotations on Usenet uses the wording ""We can't solve problems by using the same kind of thinking we used when we created them"" (the earliest example found which includes the phrase 'using the same kind of thinking we used when we created them' as in the 1995 FBIS quote above), though the author, Peter Capek, says this is just a rough memory of the quote being asked about; this exact wording was repeated in many later posts, with one of the earliest from October 1993 citing Peter Capek as the source.",4356
417,Two things are infinite: the universe and the human stupidity.,Disputed with Albert Einstein,"As discussed in this entry from The Quote Investigator, the earliest published attribution of a similar quote to Einstein seems to have been in Gestalt therapist Frederick S. Perls' 1969 book Gestalt Theory Verbatim, where he wrote on p. 33: ""As Albert Einstein once said to me: 'Two things are infinite: the universe and human stupidity. ' But what is much more widespread than the actual stupidity is the playing stupid, turning off your ear, not listening, not seeing. "" Perls also offered another variant in his 1972 book In and Out the Garbage Pail, where he mentioned a meeting with Einstein and on p. 52 quoted him saying: ""Two things are infinite, the universe and human stupidity, and I am not yet completely sure about the universe. "" However, Perls had given yet another variant of this quote in an earlier book, Ego, Hunger, and Aggression: a Revision of Freud's Theory and Method (originally published 1942, although the Quote Investigator only checked that the quote appeared in the 1947 edition), where he attributed it not to Einstein but to a ""great astronomer"", writing: ""As modern times promote hasty eating to a large extent, it is not surprising to learn that a great astronomer said: 'Two things are infinite, as far as we know – the universe and human stupidity. ' To-day we know that this statement is not quite correct. Einstein has proved that the universe is limited. "" So, the later attributions in 1969 and 1972 may have been a case of faulty memory, or of intentionally trying to increase the authority of the quote by attributing it to Einstein. The quote itself may be a variant of a similar quote attributed even earlier to the philosopher Ernest Renan, found for example in The Public: Volume 18 from 1915, which says on p. 1126: ""He quotes the saying of Renan: it isn't the stars that give him an idea of infinity; it is man's stupidity. "" (Other examples of similar attributions to Renan can be found on this Google Books search. ) Renan was French so this is presumably intended as a translation, but different sources give different versions of the supposed original French quote, such as ""La bêtise humaine est la seule chose qui donne une idée de l'infini"" (found for example in Réflexions sur la vie, 1895-1898 by Remy de Gourmont from 1903, p. 103, along with several other early sources as seen in this search) and ""Ce n'est pas l'immensité de la voûte étoilée qui peut donner le plus complétement l'idée de l'infini, mais bien la bêtise humaine! "" (found in Broad views, Volume 2 from 1904, p. 465). Since these variants have not been found in Renan's own writings, they may represent false attributions as well. They may also be variants of an even older saying; for example, the 1880 book Des vers by Guy de Maupassant includes on p. 9 a quote from a letter (dated February 19, 1880) by Gustave Flaubert where Flaubert writes ""Cependant, qui sait? La terre a des limites, mais la bêtise humaine est infinie! "" which translates to ""But who knows? The earth has its boundaries, but human stupidity is infinite! "" Similarly the 1887 book Melanges by Jules-Paul Tardivel includes on p. 273 a piece said to have been written in 1880 in which he writes ""Aujourd'hui je sais qu'il n'y a pas de limites à la bêtise humaine, qu'elle est infinie"" which translates to ""today I know that there is no limit to human stupidity, it is infinite. "" Variant: ""Only two things are infinite, the universe and human stupidity, and I'm not sure about the former. "" Earliest version located is in Technocracy digest: Issues 287–314 from 1988, p. 76. Translated to German as: ""Zwei Dinge sind unendlich: das Universum und die menschliche Dummheit. Aber beim Universum bin ich mir nicht ganz sicher. "" (Earliest version located - ""Zwei Dinge sind unendlich, das Universum und die menschliche Dummheit. . Und beim Universum bin ich mir noch keineswegs sicher"" - in Hans Askenasy: Sind wir alle Nazis? Zum Potential der Unmenschlichkeit, Campus Verlag Frankfurt/Main 1979, p. 153 books. google. )",4015
8363,"中国人民从中国解放区和国民党统治区，获得了明显的比较。难道还不明显吗？两条路线，人民战争的路线和反对人民战争的消极抗日的路线，其结果：一条是胜利的，即使处在中国解放区这种环境恶劣和毫无外援的地位；另一条是失败的，即使处在国民党统治区这种极端有利和取得外国接济的地位。国民党政府把自己的失败归咎于缺乏武器。但是试问：缺乏武器的是国民党的军队呢，还是解放区的军队？中国解放区的军队是中国军队中武器最缺乏的军队，他们只能从敌人手里夺取武器和在最恶劣条件下自己制造武器。国民党中央系军队的武器，不是比起地方系军队来要好得多吗？但是比起战斗力来，中央系却多数劣于地方系。国民党拥有广大的人力资源，但是在它的错误的兵役政策下，人力补充却极端困难。中国解放区处在被敌人分割和战斗频繁的情况之下，因为普遍实施了适合人民需要的民兵和自卫军制度，又防止了对于人力资源的滥用和浪费，人力动员却可以源源不竭。国民党拥有粮食丰富的广大地区，人民每年供给它七千万至一万万市担的粮食，但是大部分被经手人员中饱了，致使国民党的军队经常缺乏粮食，士兵饿得面黄肌瘦。中国解放区的主要部分隔在敌后，遭受敌人烧杀抢""三光""政策的摧残，其中有些是像陕北这样贫瘠的区域，但是却能用自己动手、发展农业生产的方法，很好地解决了粮食问题。国民党区域经济危机极端严重，工业大部分破产了，连布匹这样的日用品也要从美国运来。中国解放区却能用发展工业的方法，自己解决布匹和其他日用品的需要。在国民党区域，工人、农民、店员、公务人员、知识分子以及文化工作者，生活痛苦，达于极点。中国解放区的全体人民都有饭吃，有衣穿，有事做。利用抗战发国难财，官吏即商人，贪污成风，廉耻扫地，这是国民党区域的特色之一。艰苦奋斗，以身作则，工作之外，还要生产，奖励廉洁，禁绝贪污，这是中国解放区的特色之一。国民党区域剥夺人民的一切自由。中国解放区则给予人民以充分的自由。国民党统治者面前摆着这些反常的状况，怪谁呢？怪别人，还是怪他们自己呢？怪外国缺少援助，还是怪国民党政府的独裁统治和腐败无能呢？这难道还不明白吗？",Mao Zedong,"The Chinese people have come to see the sharp contrast between the Liberated Areas and the Kuomintang areas. Are not the facts clear enough? Here are two lines, the line of a people's war and the line of passive resistance, which is against a people's war; one leads to victory even in the difficult conditions in China's Liberated Areas with their total lack of outside aid, and the other leads to defeat even in the extremely favourable conditions in the Kuomintang areas with foreign aid available. The Kuomintang government attributes its failures to lack of arms. Yet one may ask, which of the two are short of arms, the Kuomintang troops or the troops of the Liberated Areas? Of all China's forces, those of the Liberated Areas lack arms most acutely, their only weapons being those they capture from the enemy or manufacture under the most adverse conditions. Is it not true that the forces directly under the Kuomintang central government are far better armed than the provincial troops? Yet in combat effectiveness most of the central forces are inferior to the provincial troops. The Kuomintang commands vast reserves of manpower, yet its wrong recruiting policy makes manpower replenishment very difficult. Though cut off from each other by the enemy and engaged in constant fighting, China's Liberated Areas are able to mobilize inexhaustible manpower because the militia and self-defence corps system, which is well-adapted to the needs of the people, is applied everywhere, and because misuse and waste of manpower are avoided. Although the Kuomintang controls vast areas abounding in grain and the people supply it with 70-100 million tan annually, its army is always short of food and its soldiers are emaciated because the greater part of the grain is embezzled by those through whose hands it passes. But although most of China's Liberated Areas, which are located in the enemy rear, have been devastated by the enemy's policy of ""burn all, kill all, loot all"", and although some regions like northern Shensi are very arid, we have successfully solved the grain problem through our own efforts by increasing agricultural production. The Kuomintang areas are facing a very grave economic crisis; most industries are bankrupt, and even such necessities as cloth have to be imported from the United States. But China's Liberated Areas are able to meet their own needs in cloth and other necessities through the development of industry. In the Kuomintang areas, the workers, peasants, shop assistants, government employees, intellectuals and cultural workers live in extreme misery. In the Liberated Areas all the people have food, clothing and work. It is characteristic of the Kuomintang areas that, exploiting the national crisis for profiteering purposes, officials have concurrently become traders and habitual grafters without any sense of shame or decency. It is characteristic of China's Liberated Areas that, setting an example of plain living and hard work, the cadres take part in production in addition to their regular duties; honesty is held in high esteem while graft is strictly prohibited. In the Kuomintang areas the people have no freedom at all. In China's Liberated Areas the people have full freedom. Who is to blame for all the anomalies which confront the Kuomintang rulers? Are others to blame, or they themselves? Are foreign countries to blame for not giving them enough aid, or are the Kuomintang government's dictatorial rule, corruption and incompetence to blame? Isn't the answer obvious?",3533
13790,Anyone who is not shocked by quantum theory has not understood it.,Disputed with Niels Bohr,"Heisenberg recounts a personal conversation he had with Pauli and Bohr in 1952 in which Bohr says, ""Those who are not shocked when they first come across quantum theory cannot possibly have understood it. "" Heisenberg, Werner, Physics and Beyond. (New York: Harper & Row, 1971) p. 206. Bohr said this sentence in a conversation with Werner Heisenberg, as quoted in: ""Der Teil und das Ganze. Gespräche im Umkreis der Atomphysik"". R. Piper & Co. , München, 1969, S. 280. DIE ZEIT 22. Aug. 1969. As quoted in Meeting the Universe Halfway (2007) by Karen Michelle Barad, p. 254, with the quote attributed to The Philosophical Writings of Niels Bohr, but with no page number or volume number given. David Mermin, on pages 186–187 of his book Boojums All the Way Through: Communicating Science in a Prosaic Age (1990) noted that he specifically looked for pithy quotes about quantum mechanics along these lines when reviewing the three volumes of The Philosophical Writings of Niels Bohr, but couldn't find any: Once I tried to teach some quantum mechanics to a class of law students, philosophers, and art historians. As an advertisement for the course I put together the most sensational quotations I could collect from the most authoritative practitioners of the subject. Heisenberg was a goldmine: ""The concept of the objective reality of the elementary particles has thus evaporated. . ""; ""the idea of an objective real world whose smallest parts exist objectively in the same sense as stones or trees exist, independently of whether or not we observe them. . is impossible. . "" Feynman did his part too: ""I think I can safely say that nobody understands quantum mechanics. "" But I failed to turn up anything comparable in the writings of Bohr. Others attributed spectacular remarks to him, but he seemed to take pains to avoid any hint of the dramatic in his own writings. You don't pack them into your classroom with ""The indivisibility of quantum phenomena finds its consequent expression in the circumstance that every definable subdivision would require a change of the experimental arrangement with the appearance of new individual phenomena, "" or ""the wider frame of complementarity directly expresses our position as regards the account of fundamental properties of matter presupposed in classical physical description but outside its scope. ""I was therefore on the lookout for nuggets when I sat down to review these three volumes – a reissue of Bohr's collected essays on the revolutionary epistemological character of the quantum theory and on the implications of that revolution for other scientific and non-scientific areas of endeavor (the originals first appeared in 1934, 1958, and 1963. ) But the most radical statement I could find in all three books was this: "". . physics is to be regarded not so much as the study of something a priori given, but rather as the development of methods for ordering and surveying human experience. "" No nuggets for the nonscientist. Variants: Those who are not shocked when they first come across quantum mechanics cannot possibly have understood it. Those who are not shocked when they first come across quantum theory cannot possibly have understood it. Anyone who is not shocked by quantum theory has not understood a single word. If you think you can talk about quantum theory without feeling dizzy, you haven't understood the first thing about it.",3403
9755,L'étymologie est une science où les voyelles ne font rien et les consonnes fort peu de chose.,Voltaire,"Etymology is a science in which vowels signify nothing at all, and consonants very little. Investigations of the comment include: Jan Noordegraaf (1997) ""Multatuli, Voltaire en de etymologie"" in Voorlopig verleden. Taalkundige plaatsbepalingen 1797-1960 (Münster: Nodus Publikationen) pp. 212-214 (). John Considine (January 2009) ""Les voyelles ne font rien, et les consonnes fort peu de chose"": On the history of Voltaire's supposed comment on etymology"" Historiographia Linguistica Volume 36, Issue 1, pp. 181-189; Garson O'Toole (25 March 2019) ""In Etymology Vowels Count for Nothing and Consonants for Very Little"" Quote Investigator From these, the earliest version found is already attributed to Voltaire: August Wilhelm von Schlegel (Part 1) of ""Review of 'Altdeutsche Wälder' vol. 1 by the Grimm brothers (Cassel 1813)"" (1815) Heidelberger Jahrbücher der Literatur no. 46 pp. 734-5 Mit solchen Allgemeinsätzen kann man Alles erkünsteln, und macht am Ende die Etymologie zu einer Wissenschaft, wobei, wie Voltaire sagt, die Vokale für gar nichts, die Konsonanten für wenig gerechnet werden. With such generalities one can artificialize everything, and in the end turn etymology into a science in which, as Voltaire says, the vowels are reckoned for nothing, the consonants for little. The earliest French version has singular voyelle / consonne: Anonymous (October 1833) ""Art. VII. -Grimm's Deutche Grammatik. Gottingen. 1822-1831. 3 vols"", The Quarterly Review (John Murray, London) volume 50 p. 169: It is in works of this nature that Germany is pre-eminent among the European nations; and it is long since those who are interested in philological researches have made a more valuable acquisition, or one more fit to wipe out from their favourite study the reproach which has been somewhat speciously cast on it, that it is a science 'où la voyelle ne fait rien, et la consonne fort peu de chose. ' Friedrich Max Müller ascribes the same French to Voltaire in October 1851: ""Review of Franz Bopp, Comparative Grammar of the Sanskrit, Zend, Greek, Latin, Lithuanian, Gothic, German, and Sclavonic Languages, transl. by Edward Backhouse Eastwick"" Edinburgh Review v. 94, no. CXCII p. 298 Müller gives the plural voyelles / consonnes version in 1864 ""Lecture VI: On the Principles of Etymology"" Lectures on the Science of Language; Second Series (London: Longman, Green, Longman, Roberts, and Green) p. 238 Leonard Bloomfield (1933) Language (New York: Henry Holt) cites Muller 1864. (s. 1. 3 p. 6 and Notes p. 511) Pierre Guiraud (1979) [1972] L'étymologie Que sais-je? vol. 1122 (Paris: Presses Universitaires de France) 4th ed. p. 24 says that Voltaire's quote specifically refers to the etymologies in Gilles Ménage's Les Origines de la langue françoise (1650; expanded in 1670 as Dictionnaire Etymologique). Guiraud prepends the etymology quote to an actual quote from Voltaire's Histoire de l'empire de Russie sous Pierre le Grand. . Il est donc incontestable que l'empereur Yu prit son nom de Menés, roi d'Égypte, et l'empereur Ki est évidemment le roi Atoës en changeant k en a et i en toës. :. . which mocks Joseph de Guignes (1760) Mémoire dans lequel on prouve que les Chinois sont une colonie égyptienne.",3222


Extraccting the usefull information from these long sources and the quotes to trim it is very complex. I think we should drop them.

One more operation we could perform in the preprocessing part is that for the sources which have NaN value, we could check if the quotes have :* this pattern in them. Because it represents the continuation of it. SOmetimes it contains the actual source for examples in the first longest quote. Although that quote is not useful in our case we atlest should try to split the quote in that and if it has someting we will put it into the source.

In [49]:
# Shortest quotes
df.nsmallest(5, 'quote_length')[['quote' , 'author' , 'source', 'quote_length']]

Unnamed: 0,quote,author,source,quote_length
579,…,Wikiquote: Templates,,1
4768,.,Anonymous,"Attributed to Joseph P. Kennedy Sr. in J. H. Cutler, Honey Fitz (1962), p. 291; also attributed to Knute Rockne, and others",1
5387,',Aristotle,"Every realm of nature is marvellous: and as Heraclitus, when the strangers who came to visit him found him warming himself at the furnace in the kitchen and hesitated to go in, is reported to have bidden them not to be afraid to enter, as even in that kitchen divinities were present, so we should venture on the study of every kind of animal without distaste; for each and all will reveal to us something natural and something beautiful. Book I, Part 5",1
5436,',Aristotle,"Virtue then is a settled disposition of the mind as regards the choice of actions and feelings, consisting essentially in the observance of the mean relative to us, this being determined by principle, that is, as the prudent man would determine it. And it is a mean state between two vices, one of excess and one of defect. Book II, 1106b. 28-1107a. 3. Rackham, . On Golden mean (philosophy)#Aristotle.",1
5866,',Cicero,"But of what immense worth is it for the soul to be with itself, to live, as the phrase is, with itself, discharged from the service of lust, ambition, strife, enmities, desires of every kind! If one has some provision laid up, as it were, of study and learning, nothing is more enjoyable than the leisure of old age. XIV, 49 (Latin, Peabody, Falconer) ""'"" translated to English as ""leisure of old age"" (Peabody) or ""leisured old age"" (Falconer), translated to Japanese as "" (Yoshida 1950). Alternate translation (Falconer): But how blessed it is for the soul, after having, as it were, finished its campaigns of lust and ambition, of strife and enmity and of all the passions, to return within itself, and, as the saying is, ""to live apart""! And indeed if it has any provender, so to speak, of study and learning, nothing is more enjoyable than a leisured old age.",1


In [53]:
# Shortest quotes
df.nsmallest(5, 'source_length')[['quote' , 'author' , 'source', 'source_length', 'heading_context']]

Unnamed: 0,quote,author,source,source_length,heading_context
2739,This is a simple story of a battle; such a tale as may be told by a soldier who is no writer to a reader who is no soldier.,Ambrose Bierce,I,1,What I Saw At Shiloh (1881)
2740,"An army's bravest men are its cowards. The death which they would not meet at the hands of the enemy they will meet at the hands of their officers, with never a flinching.",Ambrose Bierce,V,1,What I Saw At Shiloh (1881)
2741,"Hidden in hollows and behind clumps of rank brambles were large tents, dimly lighted with candles, but looking comfortable. The kind of comfort they supplied was indicated by pairs of men entering and reappearing, bearing litters; by low moans from within and by long rows of dead with covered faces outside. These tents were constantly receiving the wounded, yet were never full; they were continually ejecting the dead, yet were never empty. It was as if the helpless had been carried in and murdered, that they might not hamper those whose business it was to fall to-morrow.",Ambrose Bierce,V,1,What I Saw At Shiloh (1881)
6011,"Let the superior man never fail reverentially to order his own conduct, and let him be respectful to others and observant of propriety: —then all within the four seas, all men are brothers. What has the superior man to do with being distressed because he has no brothers?",Confucius,V,1,Analects
8280,"When you talk with famous scholars, the best thing is to pretend that occasionally you do not quite understand them. If you understand too little, you will be despised; if you understand too much, you will be disliked; if you just fail occasionally to understand them you will suit each other very well.",Lu Xun,2,1,"""The Epigrams of Lusin"""


* The problem with the 1 length quote and source is that in the quote part the quote is in another language so it has only scrapped the full stop. And all the following lists have been scrapped in the source it is solvable.
* Same with the source length 1. The source len 1 is the chapter and the heading contexxt is the book. So with both it becomes informative.

Lets see how many wuotes have smaller len that 5 and how many sources have smaller length than 5

In [54]:
short_quotes_count = (df['quote'].apply(lambda x: len(str(x))) < 5).sum()

# Count sources shorter than 5 characters
short_sources_count = (df['source'].apply(lambda x: len(str(x))) < 5).sum()

print(f"Quotes shorter than 5 chars: {short_quotes_count}")
print(f"Sources shorter than 5 chars: {short_sources_count}")

Quotes shorter than 5 chars: 44
Sources shorter than 5 chars: 7336


In [55]:
short_quotes_df = df[df['quote'].apply(lambda x: len(str(x)) < 5)]
print(short_quotes_df[['quote' , 'author' , 'source', 'heading_context']])

      quote                author  \
579       …  Wikiquote: Templates   
583     . .  Wikiquote: Templates   
4768      .             Anonymous   
5387      '             Aristotle   
5436      '             Aristotle   
5866      '                Cicero   
6019      '             Confucius   
7153    FAQ  Wikiquote: Utilities   
7155   Logo  Wikiquote: Utilities   
13554  ". "         Misquotations   
13703   NaN      Chinese proverbs   
13704   NaN      Chinese proverbs   
13707     ,      Chinese proverbs   
13708   NaN      Chinese proverbs   
13709   NaN      Chinese proverbs   
13710     ，      Chinese proverbs   
13712   NaN      Chinese proverbs   
13713   NaN      Chinese proverbs   
13714   NaN      Chinese proverbs   
13715   NaN      Chinese proverbs   
13716     一      Chinese proverbs   
13717   NaN      Chinese proverbs   
13718   NaN      Chinese proverbs   
13719     ，      Chinese proverbs   
13720   NaN      Chinese proverbs   
13721   NaN      Chinese proverbs   
1

most of them are meaningfull no need to delete all. We should certailny drop which author has wikiquote3 in it and others need to be checked.

now lets see the number of quotes for whom the source len is less than 2 adn the heading context is nan

In [64]:
mask = df['source'].apply(lambda x: len(str(x)) < 10) & df['heading_context'].isna()

count = mask.sum()
print("Number of quotes:", count)

# If you also want to see them:
print(df[mask][['quote', 'source', 'heading_context']])

Number of quotes: 777
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      quote  \
421                                                                                                                                                                                                                                

In [65]:
df[mask]['source'].unique()

array([nan, 'Couplet', 'Scene i', 'Scene iv', 'Scene ii', 'Scene I',
       'Scene III', 'Scene VII', 'Lennox', 'Scene IV', 'Scene II',
       'Scene iii', 'scene i', 'Prospero', 'Scene VI', "'.", 'Ibid.',
       'Scene xiv', 'p. 11', 'p. 12', 'p. 13', 'p. 15', 'p. 17', 'P. 19',
       'p. 28', 'p. 29', 'p. 32', 'p. 50', 'p. 50-51', 'p. 52', 'p. 54',
       'p. 58', 'p. 59', 'Cited in', '17.', 'Chapter 1', 'Chapter 2',
       'Chapter 3', 'Chapter 4', 'Chapter 5', 'Chapter 7', 'Chapter 9',
       'Epilogue', '.', '(2006).', 'Minotaur', 'Cicero', '{{cite',
       'Björk', 'proverb', ', p. xcv', '<', ', p. 70', ', p. 71',
       ', p. 97', ', p. 125', ', p. 115', ', p. 105', ', p. 112',
       ', p. 447', ', p. 588', ', p. 788', ', p. 657', ', p. 698',
       ', p. 463', ', p. 171', ', p. 602', ', p. 264', ', p. 512',
       ', p. 709', ', p. 723', ', p. 752', ', p. 734', ',', 'Daniel'],
      dtype=object)

There are so many we need to preprocess these too. The solution could be to search for source in the quote.
* same techique of splitting from the last encounterd :* and saving it as source.
*  We will do that in preprocessing

In [71]:
quote_len_99 = df['quote_length'].quantile(0.99)
source_len_99 = df['source_length'].quantile(0.99)

print("99th percentile quote length:", quote_len_99)
print("99th percentile source length:", source_len_99)

99th percentile quote length: 1784.0199999999968
99th percentile source length: 751.0


We need to dela with this also.

Also the thing we have to decide is that wgether we have to make node for all four features or just quote feature and all other are properties. In which way we can increase the speed and save memory.

Let's see what percentage of the source starts with translation.

In [93]:
pattern = re.compile(r'^\s*(translation|translations|translated|transliteration|transliterated|english equivalent|meaning|literal translation|pinyin:|english:|Original Latin:|)\b', re.IGNORECASE)

# check which rows match
df['starts_with_translation'] = df['source'].fillna("").apply(lambda x: bool(pattern.match(str(x))))

# calculate percentage
percentage = df['starts_with_translation'].mean() * 100

print(f"{percentage:.2f}% of sources start with 'translation', 'transliteration', or a variation")


2.11% of sources start with 'translation', 'transliteration', or a variation


This is a good number (500). How to utilise this?

In [94]:
df[df['starts_with_translation']][['quote', 'source','author','heading_context']].head(10)

Unnamed: 0,quote,source,author,heading_context
244,"Einer, der nur Zeitungen liest und, wenn's hochkommt, Bücher zeitgenössischer Autoren, kommt mir vor wie ein hochgradig Kurzsichtiger, der es verschmäht, Augengläser zu tragen. Er ist völlig abhängig von den vorurteilen und Moden seiner Zeit, denn er bekommt nichts anderes zu sehen und zu hören. Und was einer selbständig denkt ohne Anlehnung an das Denken und Erleben anderer, ist auch im besten Falle Ziemlich ärmlich und monoton.","Translation: Somebody who reads only newspapers and at best books of contemporary authors appears to me like an extremely near-sighted person who scorns eyeglasses. He is completely dependent on the prejudices and fashions of his times, since he never gets to see or hear anything else. And what a person thinks on his own, without being stimulated by the thoughts and experiences of other people, is, similarly, even in the best case rather paltry and monotonous. Article in Der Jungkaufmann, April 1952, Einstein Archives 28-972",Albert Einstein,1950s
565,Non-English quote.,"Translation: English Translation Source: Chapter xx, sentence xx or Act xx, Scene xx. Optional clarifications, notes on context, etc.",Wikiquote: Templates,
586,Foreign language quote.,Translation: English translation Author and source,Wikiquote: Templates,
601,"A diabolo, qui est simia dei.","Translation: ""From the devil, who is a monkey god. "" English equivalent: Where god has a church the devil will have his chapel. ""Wherever God erects a house of prayer, The Devil always builds a chapel there: And 'twill be found, upon examination, The latter has the largest congregation. "" ""where there's good there's also Evil"" Daniel Defoe, The True-Born Englishman (1701) Source for proverb:",Latin proverbs,
604,Acquirit qui tuetur.,English equivalent: Sparing is the first gaining.,Latin proverbs,
607,Aegrescit medendo.,English equivalent: The remedy is often worse than the disease.,Latin proverbs,
608,"Aegroto dum anima est, spes est.",English equivalent: As long as there is life there is hope.,Latin proverbs,
609,Aeque pars ligni curvi ac recti valet igni.,English equivalent: Crooked logs make straight fires.,Latin proverbs,
611,"Aliis si licet, tibi non licet.","Translation: If others are allowed to, that does not mean you are. (see also quod licet Iovi, non licet bovi)",Latin proverbs,
612,"An nescis, mi fili, quantilla prudentia mundus regatur? (alternatively: regatur orbis)","Translation: Don't you know, my son, with how little wisdom the world is governed? (1583 – 1654), 1648 letter to son, who was involved in negotiating the Classical and foreign quotations, William Francis Henry King, 1889, p. 40, quote #300 Sometimes attributed to Cardinal Richelieu. Variant form due to John Selden",Latin proverbs,


What we can do is to create another property "translation" of quote nodes. And utilise this in that way. Then we will see what amount of quotes have null value of len less than 5.

Next step in the EDA is to check if the quotes or the source contains links or not.

In [78]:
url_pattern = re.compile(r'https?://\S+')

# Boolean columns for detection
df['quote_has_link'] = df['quote'].fillna("").apply(lambda x: bool(url_pattern.search(x)))
df['source_has_link'] = df['source'].fillna("").apply(lambda x: bool(url_pattern.search(x)))

# Summary counts
print("Quotes with links:", df['quote_has_link'].sum())
print("Sources with links:", df['source_has_link'].sum())

# Percentage
print("Quotes with links (%):", df['quote_has_link'].mean() * 100)
print("Sources with links (%):", df['source_has_link'].mean() * 100)

Quotes with links: 0
Sources with links: 0
Quotes with links (%): 0.0
Sources with links (%): 0.0


They don't have typical links because `I took care of that while extracting the information. But there is a pattern I noticed which is has not been taken care of
* {{ cite book

In [91]:
pattern = r"\{\{\s*cite\s+book"
count = df['source'].fillna("").str.contains(pattern, case=False, regex=True).sum()

print(f"Number of sources containing '{{{{cite book}}}}': {count}")


Number of sources containing '{{cite book}}': 47


In [92]:
matches = df[df['source'].fillna("").str.contains(pattern, case=False, regex=True)]
print(matches[['quote', 'source']])

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

There are 47 quotes which has {{ cite book in it but the one which ends with it are meaning less. So we will delete only the quotes which ends with {{cite boo and leave others.

Let's see if we have any duplicate quotes

In [100]:
# Find duplicates based on BOTH quote and source
duplicates = df[df.duplicated(subset=['quote', 'source', 'author'], keep=False)]
# pd.set_option('display.max_colwidth', None)
# See how many there are
print(f"Total duplicate rows (quote+source): {duplicates.shape[0]}")

# View some examples
duplicates.sort_values(by='quote').head(20)

Total duplicate rows (quote+source): 88


Unnamed: 0,quote,source,author,heading_context,quote_length,source_length,starts_with_translation,quote_has_link,source_has_link
23577,"""Well done"" is better than ""well said"".",,English proverbs,,39,3,False,False,False
24081,"""Well done"" is better than ""well said"".",,English proverbs,,39,3,False,False,False
566,(. . quote. . ),,Wikiquote: Templates,,15,3,False,False,False
567,(. . quote. . ),,Wikiquote: Templates,,15,3,False,False,False
568,(. . quote. . ),,Wikiquote: Templates,,15,3,False,False,False
569,(. . quote. . ),,Wikiquote: Templates,,15,3,False,False,False
23389,Accidents will happen in the best families. (19th century),"Citatboken, Bokförlaget Natur och Kultur, Stockholm, 1967, p. 187, ISBN 91-27-01681-1",English proverbs,,58,85,False,False,False
23659,Accidents will happen in the best families. (19th century),"Citatboken, Bokförlaget Natur och Kultur, Stockholm, 1967, p. 187, ISBN 91-27-01681-1",English proverbs,,58,85,False,False,False
22000,"After dinner Mr. Mill read us Shelley's Ode to Liberty & he got quite excited & moved over it rocking backwards & forwards & nearly choking with emotion; he said himself: ""it is almost too much for one. ""","Lord Amberley, journal entry (28 September 1870), quoted in Bertrand and Patricia Russell (eds. ), The Amberley Papers, Volume II (1937), p. 375",unknown,,204,144,False,False,False
25735,"After dinner Mr. Mill read us Shelley's Ode to Liberty & he got quite excited & moved over it rocking backwards & forwards & nearly choking with emotion; he said himself: ""it is almost too much for one. ""","Lord Amberley, journal entry (28 September 1870), quoted in Bertrand and Patricia Russell (eds. ), The Amberley Papers, Volume II (1937), p. 375",unknown,,204,144,False,False,False


There are multiple duplicates in the dataset. We have to remove them also

Now I think the EDA is complete let's preprocess the dataframe 

### Preprocessing

Let's write down the steps in the preprocessing.
* I have to deal with the rows corresponding to the null quote values. Either extract quopte from the sourcce or remove them.(most of them are chinese proverbs)
* delete wikiquote pages
* deal with unit len quote and source
* deal with very large quote and source
* One more operation we could perform in the preprocessing part is that for the sources which have NaN value, we could check if the quotes have :* this pattern in them. Because it represents the continuation of it. Sometimes it contains the actual source for examples in the first longest quote. Although that quote is not useful in our case we atlest should try to split the quote in that and if it has someting we will put it into the source.
* check if the source has only [nan, 'Couplet', 'Scene i', 'Scene iv', 'Scene ii', 'Scene I',
       'Scene III', 'Scene VII', 'Lennox', 'Scene IV', 'Scene II',
       'Scene iii', 'scene i', 'Prospero', 'Scene VI', "'.", 'Ibid.',
       'Scene xiv', 'p. 11', 'p. 12', 'p. 13', 'p. 15', 'p. 17', 'P. 19',
       'p. 28', 'p. 29', 'p. 32', 'p. 50', 'p. 50-51', 'p. 52', 'p. 54',
       'p. 58', 'p. 59', 'Cited in', '17.', 'Chapter 1', 'Chapter 2',
       'Chapter 3', 'Chapter 4', 'Chapter 5', 'Chapter 7', 'Chapter 9',
       'Epilogue', '.', '(2006).', 'Minotaur', 'Cicero', '{{cite',
       'Björk', 'proverb', ', p. xcv', '<', ', p. 70', ', p. 71',
       ', p. 97', ', p. 125', ', p. 115', ', p. 105', ', p. 112',
       ', p. 447', ', p. 588', ', p. 788', ', p. 657', ', p. 698',
       ', p. 463', ', p. 171', ', p. 602', ', p. 264', ', p. 512',
       ', p. 709', ', p. 723', ', p. 752', ', p. 734', ',', 'Daniel'],
      dtype=object) and no heading context or author. If these two are not present the source itself is not meaning full. so we have to deal with it. Or see if the quotes contains source :* after this pattern. if yes then I will extract that form the quote and put in the source/
* Also the thing we have to decide is that wgether we have to make node for all four features or just quote feature and all other are properties. In which way we can increase the speed and save memory.
* 2.11% of sources start with 'translation', 'transliteration', or a variation. What we can do is to create another property "translation" of quote nodes. And utilise this in that way. Then we will see what amount of quotes have null value of len less than 5.
* There are 47 quotes which has {{ cite book in it but the one which ends with it are meaning less. So we will delete only the quotes which ends with {{cite boo and leave others.
* There are 88 duplicate rows need to delete them.

Remove htmls. wikiquotes authers, some patterns from the quote and source text.

So we have two options for chinese proverbs

to drop the columns for which the quote is NaN value.
to switch the source and the quote value where the quote is NaN. Because the source contains the translation and in a way it is also a quote. we will decide in the preprocessing part

We should do something for  Albert einstien and misattributed to ALbert Einstien and disputed with albert Einstin are three different authors in the author list. Which is a entity fragmentation. Wer should think like we should put the author as a node in the databse or the property of the quote in the databse. If the author is a node then these three variations should be replaced with one author node and the misattributed and disputed should be property of code for better memory utilisation

One more operation we could perform in the preprocessing part is that for the sources which have NaN value, we could check if the quotes have :* this pattern in them. Because it represents the continuation of it. SOmetimes it contains the actual source for examples in the first longest quote. Although that quote is not useful in our case we atlest should try to split the quote in that and if it has someting we will put it into the source.

Drop wikiquote templates


We will check the single length quote rows and max len in the preprocessing

First of all let's drop the wikiquote template and duplicates

In [26]:
df = df[df["author"].ne("Wikiquote: Templates")].reset_index(drop=True)
df = df[df["author"].ne("Wikiquote: Requests")].reset_index(drop=True)
df = df[df["author"].ne("Wikiquote: Utilities")].reset_index(drop=True)
df = df.drop_duplicates(subset=["quote", "source", "heading_context"]).reset_index(drop=True)


ALso let's remove every entry related to wikkiquote as author

In [38]:
# Find rows where author starts with "Wikiquote:"
wikiquote_authors = df_new[df_new["author"].astype(str).str.startswith("Wikiquote:")]
wikiquote_authors


Unnamed: 0,quote,source,author,heading_context
3926,Famous speeches of historical value can be added to the speeches portal at Wikisource.,,Wikiquote: What Wikiquote is not,
3927,"Whole public domain books can be added to Wikisource or Wikibooks, see Wikisource Project: Annotations.",,Wikiquote: What Wikiquote is not,
3928,Generic deposits of published information may be appropriate at Wikisource.,,Wikiquote: What Wikiquote is not,
3929,"Book PDFs/DJVUs, pictures, sound files, and other forms of media that are in the public domain or under a libre license are accepted at Wikimedia Commons.",,Wikiquote: What Wikiquote is not,
3930,Wikipedia: What Wikipedia is not,,Wikiquote: What Wikiquote is not,
...,...,...,...,...
25713,Themebot - whether we should have an automatic theme generator. (abandoned concept),,Wikiquote: Issues,
25714,Logo - what the Wikiquote logo should be. (has been chosen) Wikiquote|,,Wikiquote: Issues,
26628,"""sourced quote""",,Wikiquote: Sourced and Unsourced sections,
26629,"""moved to Misattributed after finding original quote""",,Wikiquote: Sourced and Unsourced sections,


In [92]:
df = df[~df["author"].astype(str).str.startswith("Wikiquote:")].reset_index(drop=True)
df = df[~df["author"].astype(str).str.startswith("List of categories")].reset_index(drop=True)
# List of categories

In [93]:
df["quote"] = df["quote"].replace("NaN", np.nan)

# Regex: only spaces or punctuation (no letters/numbers)
only_special_pattern = r'^[^\w\u4e00-\u9fff]+$'  # excludes alphanumerics + Chinese chars

mask = (
    df["quote"].isna() | 
    df["quote"].astype(str).str.fullmatch(only_special_pattern)
)

# Filter
special_or_nan_quotes = df[mask]

print(f"Found {len(special_or_nan_quotes)} rows")
display(special_or_nan_quotes)

Found 32 rows


Unnamed: 0,quote,source,author,heading_context
4656,.,"Attributed to Joseph P. Kennedy Sr. in J. H. Cutler, Honey Fitz (1962), p. 291; also attributed to Knute Rockne, and others",Anonymous,
5275,',"Every realm of nature is marvellous: and as Heraclitus, when the strangers who came to visit him found him warming himself at the furnace in the kitchen and hesitated to go in, is reported to have bidden them not to be afraid to enter, as even in that kitchen divinities were present, so we should venture on the study of every kind of animal without distaste; for each and all will reveal to us something natural and something beautiful. Book I, Part 5",Aristotle,Parts of Animals
5324,',"Virtue then is a settled disposition of the mind as regards the choice of actions and feelings, consisting essentially in the observance of the mean relative to us, this being determined by principle, that is, as the prudent man would determine it. And it is a mean state between two vices, one of excess and one of defect. Book II, 1106b. 28-1107a. 3. Rackham, . On Golden mean (philosophy)#Aristotle.",Aristotle,Book II
5754,',"But of what immense worth is it for the soul to be with itself, to live, as the phrase is, with itself, discharged from the service of lust, ambition, strife, enmities, desires of every kind! If one has some provision laid up, as it were, of study and learning, nothing is more enjoyable than the leisure of old age. XIV, 49 (Latin, Peabody, Falconer) ""'"" translated to English as ""leisure of old age"" (Peabody) or ""leisured old age"" (Falconer), translated to Japanese as "" (Yoshida 1950). Alternate translation (Falconer): But how blessed it is for the soul, after having, as it were, finished its campaigns of lust and ambition, of strife and enmity and of all the passions, to return within itself, and, as the saying is, ""to live apart""! And indeed if it has any provender, so to speak, of study and learning, nothing is more enjoyable than a leisured old age.",Cicero,Cato Maior de Senectute – On Old Age (44 BC)
5907,',"The Master said, ""Hard is it to deal with him, who will stuff himself with food the whole day, without applying his mind to anything good! Are there not gamesters and chess players? To be one of these would still be better than doing nothing at all. "" Book XVII, Chapter XXII ""'"" means ""to gamble"" in Modern Chinese, and ""to play Go; to play chess"" in Classical Chinese (博弈, 博奕). It is translated as ""gamesters and chess players"" (Legge 1861), ""play at chequers"" (Lyall 1909), and ""checker or chess players"" (Soothill 1910).",Confucius,Analects
13404,""". ""","This is often erroneously assumed to be the quote of Ben Parker dating back to the original Spider-Man origin story as depicted in 1962's Amazing Fantasy #15. This statement appears as a caption of narration in the last panel of the story and was not a spoken line by any character in the story. In most retellings of Spider-Man's origin, including the 2002 film, the quote has been retconned (the alteration of previously established facts in the continuity of a fictional work) to depict Uncle Ben's final lecture to Peter Parker before Ben's tragic death and as the words that continue to drive Peter as Spider-Man. Also, the correct Amazing Fantasy quote is, ""With great power there must also come great responsibility. """,Misquotations,
13549,， ， ， ； 。,"Transliteration (pinyin): Bù wén bù ruò wén zhī, wén zhī bù ruò jiàn zhī, jiàn zhī bù ruò zhīzhī, zhīzhī bù ruò xíng zhī; xué zhìyú xíng zhī ér zhǐ yǐ. Traditional: 不聞不若聞之，聞之不若見之，見之不若知之，知之不若行之；學至於行之而止矣 Simplified: 不闻不若闻之，闻之不若见之，见之不若知之，知之不若行之；学至于行之而止矣 From Xun Zi (荀子 8. 儒效 23）.",Chinese proverbs,B
13553,,Transliteration (pinyin): Chángjiāng hòulàng tuī qiánlàng. Traditional: 長江後浪推前浪 Simplified: 长江后浪推前浪 Meaning: The energy of the old generation inspires the new.,Chinese proverbs,Ch
13554,,Transliteration (pinyin): Dú wàn juǎn shū bùrú xíng wànlǐ lù. Traditional: 讀萬卷書不如行萬裡路 Simplified: 读万卷书不如行万里路 Reading ten thousand books is not as useful as traveling ten thousand miles. English equivalent: An ounce of practice is worth a pound of theory.,Chinese proverbs,D
13557,",","Transliteration (pinyin): Fú wú chóng zhì, huòbùdānxíng. Traditional: 福無重至, 禍不單行 Simplified: 福无重至, 祸不单行 Fortune does not come twice. Misfortune does not come alone. Meaning: Good things will only come once. Bad things will always come in groups. English equivalent: Misery loves company. Meaning: Opportunities should not be taken for granted. A problem ignored is a problem doubled.",Chinese proverbs,F


In [94]:
# Missing values
print(df.isna().sum())

quote                16
source             6846
author                0
heading_context    9150
dtype: int64


We can see that the source has multiple translation, pinyin, and other headings and values. We can convert this to dictionary and use one of the value as the quote if the value of the quote is meaning less.

In [95]:

def parse_kv(text, keys=None):
    if not isinstance(text, str) or not text.strip():
        return {}
    if keys is None:
        keys = [
            "Transliteration (pinyin)", "Transliteration", "Pinyin",
            "Traditional", "Simplified", "Meaning", "Literal",
            "Translation", "Variant", "Notes", "Source","who"
        ]
    key_alt = "|".join(map(re.escape, keys))
    pat = re.compile(rf"(?i)\b({key_alt})\b\s*:", re.DOTALL)
    matches = list(pat.finditer(text))
    if not matches:
        return {}
    out = {}
    for i, m in enumerate(matches):
        key_raw = m.group(1)
        start = m.end()
        end = matches[i+1].start() if i+1 < len(matches) else len(text)
        val = text[start:end].strip()
        val = re.sub(r"\s+", " ", val).strip(" \t\n\r\"'，。；;:,-")
        if key_raw in out and val:
            out[key_raw] = f"{out[key_raw]} | {val}"
        elif val:
            out[key_raw] = val
    return out



In [96]:
df_new = df.copy()

for idx in df_new[mask].index:
    kv = parse_kv(df_new.at[idx, "source"])
    if kv:
        # Take first key's value
        first_val = next(iter(kv.values()))
        df_new.at[idx, "quote"] = first_val

        # Remove that key/value from the source text
        first_key = next(iter(kv.keys()))
        pattern = re.compile(rf"{re.escape(first_key)}\s*:\s*{re.escape(first_val)}", re.IGNORECASE)
        df_new.at[idx, "source"] = pattern.sub("", df_new.at[idx, "source"], count=1).strip()

df_new[mask]

Unnamed: 0,quote,source,author,heading_context
4656,.,"Attributed to Joseph P. Kennedy Sr. in J. H. Cutler, Honey Fitz (1962), p. 291; also attributed to Knute Rockne, and others",Anonymous,
5275,',"Every realm of nature is marvellous: and as Heraclitus, when the strangers who came to visit him found him warming himself at the furnace in the kitchen and hesitated to go in, is reported to have bidden them not to be afraid to enter, as even in that kitchen divinities were present, so we should venture on the study of every kind of animal without distaste; for each and all will reveal to us something natural and something beautiful. Book I, Part 5",Aristotle,Parts of Animals
5324,',"Virtue then is a settled disposition of the mind as regards the choice of actions and feelings, consisting essentially in the observance of the mean relative to us, this being determined by principle, that is, as the prudent man would determine it. And it is a mean state between two vices, one of excess and one of defect. Book II, 1106b. 28-1107a. 3. Rackham, . On Golden mean (philosophy)#Aristotle.",Aristotle,Book II
5754,',"But of what immense worth is it for the soul to be with itself, to live, as the phrase is, with itself, discharged from the service of lust, ambition, strife, enmities, desires of every kind! If one has some provision laid up, as it were, of study and learning, nothing is more enjoyable than the leisure of old age. XIV, 49 (Latin, Peabody, Falconer) ""'"" translated to English as ""leisure of old age"" (Peabody) or ""leisured old age"" (Falconer), translated to Japanese as "" (Yoshida 1950). Alternate translation (Falconer): But how blessed it is for the soul, after having, as it were, finished its campaigns of lust and ambition, of strife and enmity and of all the passions, to return within itself, and, as the saying is, ""to live apart""! And indeed if it has any provender, so to speak, of study and learning, nothing is more enjoyable than a leisured old age.",Cicero,Cato Maior de Senectute – On Old Age (44 BC)
5907,',"The Master said, ""Hard is it to deal with him, who will stuff himself with food the whole day, without applying his mind to anything good! Are there not gamesters and chess players? To be one of these would still be better than doing nothing at all. "" Book XVII, Chapter XXII ""'"" means ""to gamble"" in Modern Chinese, and ""to play Go; to play chess"" in Classical Chinese (博弈, 博奕). It is translated as ""gamesters and chess players"" (Legge 1861), ""play at chequers"" (Lyall 1909), and ""checker or chess players"" (Soothill 1910).",Confucius,Analects
13404,""". ""","This is often erroneously assumed to be the quote of Ben Parker dating back to the original Spider-Man origin story as depicted in 1962's Amazing Fantasy #15. This statement appears as a caption of narration in the last panel of the story and was not a spoken line by any character in the story. In most retellings of Spider-Man's origin, including the 2002 film, the quote has been retconned (the alteration of previously established facts in the continuity of a fictional work) to depict Uncle Ben's final lecture to Peter Parker before Ben's tragic death and as the words that continue to drive Peter as Spider-Man. Also, the correct Amazing Fantasy quote is, ""With great power there must also come great responsibility. """,Misquotations,
13549,不聞不若聞之，聞之不若見之，見之不若知之，知之不若行之；學至於行之而止矣,"Transliteration (pinyin): Bù wén bù ruò wén zhī, wén zhī bù ruò jiàn zhī, jiàn zhī bù ruò zhīzhī, zhīzhī bù ruò xíng zhī; xué zhìyú xíng zhī ér zhǐ yǐ. Simplified: 不闻不若闻之，闻之不若见之，见之不若知之，知之不若行之；学至于行之而止矣 From Xun Zi (荀子 8. 儒效 23）.",Chinese proverbs,B
13553,長江後浪推前浪,Transliteration (pinyin): Chángjiāng hòulàng tuī qiánlàng. Simplified: 长江后浪推前浪 Meaning: The energy of the old generation inspires the new.,Chinese proverbs,Ch
13554,讀萬卷書不如行萬裡路,Transliteration (pinyin): Dú wàn juǎn shū bùrú xíng wànlǐ lù. Simplified: 读万卷书不如行万里路 Reading ten thousand books is not as useful as traveling ten thousand miles. English equivalent: An ounce of practice is worth a pound of theory.,Chinese proverbs,D
13557,"福無重至, 禍不單行","Transliteration (pinyin): Fú wú chóng zhì, huòbùdānxíng. Simplified: 福无重至, 祸不单行 Fortune does not come twice. Misfortune does not come alone. Meaning: Good things will only come once. Bad things will always come in groups. English equivalent: Misery loves company. Meaning: Opportunities should not be taken for granted. A problem ignored is a problem doubled.",Chinese proverbs,F


In [97]:
print(df_new.isna().sum())

quote                 0
source             6846
author                0
heading_context    9150
dtype: int64


deleting the only punctuation quotes

In [98]:
only_punct_pattern = r'^[^\w\u4e00-\u9fff]+$'

# Mask for "only punctuation"
mask_only_punct = df_new["quote"].astype(str).str.fullmatch(only_punct_pattern)

# Preview them before removal
punct_only_rows = df_new[mask_only_punct]
punct_only_rows

# Remove them
df_new = df_new[~mask_only_punct].reset_index(drop=True)

In [99]:
short_quotes_count_new = (df_new['quote'].apply(lambda x: len(str(x))) < 5).sum()

print(f"Quotes shorter than 10 chars: {short_quotes_count_new}")

short_quotes_df_new = df_new[df_new['quote'].apply(lambda x: len(str(x)) < 5)]
short_quotes_df_new[['quote' , 'author' , 'source', 'heading_context']]

Quotes shorter than 10 chars: 10


Unnamed: 0,quote,author,source,heading_context
13552,父債子還,Chinese proverbs,"Transliteration (pinyin): Fù zhài zǐ huán. Simplified: 父债子还 Father's debt, son to give back. Meaning: The new generation can fix the mistakes made by previous ones. ""New generation can put right the mistakes of the old. "" ""To do the opposite of something is also a form of imitation. "" Georg Lichtenberg, The Waste Books, R. J. Hollingdale trans. (2000), D96.",F
13557,良藥苦口,Chinese proverbs,"Transliteration (pinyin): Liángyào kǔkǒu Simplified: 良药苦口 Good medicine tastes bitter. Meaning: What may be good for us later may be hard for us now. English equivalent: Bitter pills may have blessed effects. ""Present afflictions may tend to our future good. """,L
13560,一,Chinese proverbs,"Transliteration (pinyin): Ròu bāozi dǎ gǒu 一 qù bù huítóu. Traditional: 肉包子打狗一去不回頭 Simplified: 肉包子打狗一去不回头 To hit a dog with a meat-bun, so it leaves never turning around. Meaning: Punishment gives less incentive than a reward.",R
13588,光阴似箭,Chinese proverbs,English equivalent: Time flies like an arrow.,Z
14001,No.,Last words,"Who: Alexander Graham Bell, a Scottish-born scientist, inventor, engineer and innovator who is credited with patenting the first practical telephone. Note: While Alexander Graham Bell was dying, his deaf wife whispered to him, ""Don't leave me. "" Bell responded by signing the word, ""No. """,
14563,V-1.,Last words,"Who: Klaas Meurs Note: Klaas Meurs was the First Officer of KLM Flight 4805, which crashed on take-off on March 27th 1977, killing 583 people when it collided with a Pan Am Boeing 747, killing all 248 on Flight 4805 (including Meurs) and 335 on the Pan Am Aircraft. His last words of ""V-1"" where the Aviation terminology used when an Aircraft is going too fast to cancel the take-off and must take-off. Eight seconds later, both planes collided.",
14727,Yeah,Last words,,
14740,No.,Last words,"Who: Alfred Rosenberg, Nazi ideologist and minister Note: When asked if he had any last words before being executed by hanging.",
14877,Yes.,Last words,Who: Alice B. Toklas Note: Her response when asked if she wanted to die.,
14938,No.,Last words,"Who: H. K. ""Hank"" Williams, American country singer Note: In response to whether he wanted something to eat.",


But first let's delete the rows having the author unknown, source NaN and heading_context = NaN

In [100]:
mask_del = (
    df_new["author"].astype(str).str.strip().str.lower().eq("unknown") &
    (
        df_new["source"].isna() |
        df_new["source"].astype(str).str.fullmatch(only_punct_pattern)
    ) &
    df_new["heading_context"].isna()
)

# Preview before deletion
display(df_new[mask_del])

# Delete them
df_new = df_new[~mask_del].reset_index(drop=True)

Unnamed: 0,quote,source,author,heading_context
542,"His work revolved around three rules which apply to all science, our problems, and times:",,unknown,
543,": 1. Out of clutter, find simplicity;",,unknown,
544,: 2. From discord make harmony; and finally,,unknown,
2104,Russell is the most gifted Englishman alive.,,unknown,
2105,"Russell doesn't understand the importance of the past, or of tradition, and—he won't qualify.",,unknown,
2432,"""I wish I'd said that""",,unknown,
4447,"Plain language sounds purely objective. On the one hand, it has not the accent of mere vituperation, it is thoroughly dignified; and on the other, it is not the language of a person who is mainly concerned with wangling somebody into believing something. When Mr. Jefferson wrote that one of his associates in Washington's cabinet was ""a fool and a blabber, "" his words, taken in their context, make exactly the same impression of calm, disinterested and objective appraisal as if he had remarked that the man had black hair and brown eyes. Or again, while we are about it, let us examine the most extreme example of this sort of thing that I have so far found in English literature, which is Kent's opinion of Oswald, in King Lear: :: Kent. Fellow, I know thee. :: Osw. What dost thou know me for? :: Kent. A knave; a rascal; an eater of broken meats; a base, proud, shallow, beggarly, three-suited, hundred-pound, filthy, worsted-stocking knave; a lily-livered, action-taking whoreson, glass-gazing, & super-servicable, finical rogue; onetrunk-inheriting slave; one that wouldst be a bawd, in way of good service, and art nothing but the composition of a knave, beggar, coward, pandar, and the son and heir of a mongrel bitch. : Now, considering Kent's character and conduct, as shown throughout the play, I doubt very much that those lines should be taken as merely so much indecent blackguarding. . . an actor who ranted through them in the tone and accent of sheer violent diatribe would ruin his part. Frank Warrin cited those lines the other day, when he was telling me how much he would enjoy a revival of Lear, with our gifted friend Bill Parke cast for the part of Kent. He said, ""Can't you hear Bill's voice growing quieter and quieter, colder and colder, deadlier and deadlier, all the way through that passage? "" Angry as Kent is, and plain as his language is, his tone and manner must carry a strong suggestion of objectivity in order to keep fully up to the dramatist's conception of his role. Kent is not abusing Oswald; he is merely, as we say, ""telling him. "": * Albert Jay Nock, in ""Free Speech and Plain Language"" The Atlantic Monthly (January 1936)",,unknown,
4448,"Lear is a play [that] contains a great deal of veiled social criticism — but it is all uttered either by the Fool, by Edgar when he is pretending to be mad, or by Lear during his bouts of madness. In his sane moments Lear hardly ever makes an intelligent remark. :* George Orwell, in Lear",,unknown,
4638,"Caesar overtook his advanced guard at the river Rubicon, which formed the frontier between Gaul and Italy. Well aware how critical a decision confronted him, he turned to his staff, remarking: :: ""We may still draw back but, once across that little bridge, we shall have to fight it out"": As he stood, in two minds, an apparition of superhuman size and beauty was seen sitting on the river bank playing a reed pipe. A party of shepherds gathered around to listen and, when some of Caesar's men broke ranks to do the same, the apparition snatched a trumpet from one of them, ran down to the river, blew a thunderous blast, and crossed over. Caesar exclaimed: :: ""Let us accept this as a sign from the Gods, and follow where they beckon, in vengeance on our double-dealing enemies. The die is cast. "": He led his army to the farther bank, where he welcomed the tribunes of the people who had fled to him from Rome. Then he tearfully addressed the troops and, ripping open his tunic to expose his breast, begged them to stand faithfully by him. :* Suetonius, in The Twelve Caesars, as translated by Robert Graves (1957), ¶ 31-33: * Variant translations: :* He caught up with his cohorts at the River Rubicon, which was the boundary of his province, where he paused for a while, thinking over the magnitude of what he was planning, then, turning to his closest companions, he said: ""Even now we can still turn back. But once we have crossed that little bridge, everything must be decided by arms. "" As he paused, the following portent occurred. A being of splendid size and beauty suddenly appeared, sitting close by, and playing music on a reed. A large number of shepherds hurried to listen to him and even some of the soldiers left their posts to come, trumpeters among them. From one of these, the apparition seized a trumpet, leapt down to the river, and with a huge blast sounded the call to arms and crossed over to the other bank. Then said Caesar: ""Let us go where the gods have shown us the way and the injustice of our enemies calls us. The die is cast. "" And so the army crossed over and welcomed the tribunes of the plebs who had come over to them, having been expelled from Rome. Caesar addressed the sol- diers, appealing to their loyalty, with tears, and ripping the garments from his breast. :**As translated by Catherine Edwards (2000)",,unknown,
7040,"Colonel Gaston Bell: General McAuliffe refused a German surrender demand. You know what he said? :General George S. Patton: What? :Colonel Gaston Bell: ""Nuts! "": Patton: [laughing] Keep them moving, colonel. A man that eloquent has to be saved. :* Francis Ford Coppola and Edmund H. North, in Patton (1970), depicting Patton leading three divisions towards Bastogne.",,unknown,


In [101]:
# Mask: source length < 5 AND heading_context is NaN
mask_src_short_no_context = (
    df_new["source"].astype(str).str.len() < 5
) & (
    df_new["heading_context"].isna()
)

# Preview the rows
short_source_no_context = df_new[mask_src_short_no_context]
display(short_source_no_context)


Unnamed: 0,quote,source,author,heading_context
421,Stay away from negative people. They have a problem for every solution.,,Disputed with Albert Einstein,
1539,"Republicans approve of the American farmer, but they are willing to help him go broke. They stand four-square for the American home--but not for housing. They are strong for labor--but they are stronger for restricting labor's rights. They favor minimum wage--the smaller the minimum wage the better. They endorse educational opportunity for all--but they won't spend money for teachers or for schools. They think modern medical care and hospitals are fine--for people who can afford them. . .They think the American standard of living is a fine thing--so long as it doesn't spread to all the people. And they admire the Government of the United States so much that they would like to buy it. − Harry S. Truman, October 13, 1948, St. Paul, Minnesota, Radio Broadcast. https: //www. trumanlibrary. gov/library/public-papers/236/address-st-paul-municipal-auditoriumhttp: //www. presidency. ucsb. edu/ws/index. php? pid=13046https: //www. goodreads. com/quotes/97983-republicans-approve-of-the-american-farmer-but-they-are-willinghttps: //www. snopes. com/politics/quotes/trumangop. asphttps: //spydersden. wordpress. com/2014/11/22/ten-quotes-about-republicans-from-harry-truman/https: //www. nytimes. com/2017/11/24/opinion/republican-taxes-healthcare. html#permid=24954077",,Harry S. Truman,
1540,"If you want a friend in Washington, get a dog.",,Harry S. Truman,
2688,"For there be divers sorts of death—some wherein the body remaineth; and in some it vanisheth quite away with the spirit. This commonly occurreth only in solitude (such is God's will) and, none seeing the end, we say the man is lost, or gone on a long journey—which indeed he hath; but sometimes it hath happened in sight of many, as abundant testimony showeth. In one kind of death the spirit also dieth, and this it hath been known to do while yet the body was in vigor for many years. Sometimes, as is veritably attested, it dieth with the body, but after a season is raised up again in that place where the body did decay.",,Ambrose Bierce,
2689,"On every side of me stretched a bleak and desolate expanse of plain, covered with a tall overgrowth of sere grass, which rustled and whistled in the ​autumn wind with heaven knows what mysterious and disquieting suggestion. Protruded at long intervals above it, stood strangely shaped and somber-colored rocks, which seemed to have an understanding with one another and to exchange looks of uncomfortable significance, as if they had reared their heads to watch the issue of some foreseen event. A few blasted trees here and there appeared as leaders in this malevolent conspiracy of silent expectation.",,Ambrose Bierce,
...,...,...,...,...
27739,"In the end, more than freedom, they wanted security. They wanted a comfortable life, and they lost it all – security, comfort, and freedom. When the Athenians finally wanted not to give to society but for society to give to them, when the freedom they wished for most was freedom from responsibility, then Athens ceased to be free and was never free again. :* This quotation appeared in an article by Margaret Thatcher, ""The Moral Foundations of Society"" (Imprimis, March 1995), which was an edited version of a lecture Thatcher had given at Hillsdale College in November 1994. Here is the actual passage from Thatcher's article: :: [M]ore than they wanted freedom, the Athenians wanted security. Yet they lost everything—security, comfort, and freedom. This was because they wanted not to give to society, but for society to give to them. The freedom they were seeking was freedom from responsibility. It is no wonder, then, that they ceased to be free. In the modern world, we should recall the Athenians' dire fate whenever we confront demands for increased state paternalism. :: The italicized passage above originated with Thatcher. In characterizing the Athenians in the article she cited Sir Edward Gibbon, but she seems to have been paraphrasing statements in ""Athens' Failure, "" a chapter of classicist Edith Hamilton's book The Echo of Greece (1957), pp. 47–48).",,Misattributed to Edward Gibbon,
28030,"No sound of wheels or hoof-beat breaks The silence of the summer day, As by the loveliest of all lakes I while the idle hours away. I pace the leafy colonnade Where level branches of the plane Above me weave a roof of shade Impervious to the sun and rain. At times a sudden rush of air Flutters the lazy leaves o'erhead, And gleams of sunshine toss and flare Like torches down the path I tread. By Somariva's garden gate I make the marble stairs my seat, And hear the water, as I wait, Lapping the steps beneath my feet. The undulation sinks and swells Along the stony parapets, And far away the floating bells Tinkle upon the- fisher's nets. Silent and slow, by tower and town The freighted barges come and go, Their pendent shadows gliding down By town and tower submerged below. The hills sweep upward from the shore With villas scattered one by one Upon their wooded spurs, and lower Bellagio blazing in the sun. And dimly seen, a tangled mass Of walls and woods, of light and shade, Stands beckoning up the Stelvio Pass Varenna with its white cascade. I ask myself, Is this a dream? Will it all vanish into air-? Is there a land of such supreme And perfect beauty anywhere? Sweet vision! Do not fade away; Linger until my heart shall take Into itself the summer day, And all the, beauty of the lake. Linger until upon my brain Is stamped an image of the scene, Then fade into the air again, And be as if thou hadst not been.",,Henry Wadsworth Longfellow,
28348,"All the breath and the bloom of the year in the bag of one bee: All the wonder and wealth of the mine in the heart of one gem: In the core of one pearl all the shade and the shine of the sea: Breath and bloom, shade and shine, — wonder, wealth, and — how far above them —: : Truth, that's brighter than gem, :: Trust, that's purer than pearl, —: Brightest truth, purest trust in the universe, — all were for me: : In the kiss of one girl. :: * ""Summum Bonum"" (1889).",,Robert Browning,
28607,"""frank hearted maids of rocky Cumberland"" Wordsworths' approval of the locals following a night of dancing during his 1788 summer vacation away from Cambridge University. The Early Life of William Wordsworth. Emile Legouis. Pub Dent & Co. 1897. Page 100.",,William Wordsworth,


From the quote extracting the source if it is missing.

In [102]:
# import re
# import numpy as np
# import pandas as pd

# # consider "only punctuation/whitespace" as empty
ONLY_PUNCT_RE = r'^[^\w\u4e00-\u9fff]+$'

# split on variations like ":*", ": *", "：*", multiple asterisks, extra spaces
CONT_SPLIT_RE = re.compile(r'\s*[:：]\s*\*+\s*')

def clean_text(s: str) -> str:
    if not isinstance(s, str):
        return ""
    s = re.sub(r"\s+", " ", s).strip()
    # trim common trailing punctuation clutter
    return s.strip(" \t\n\r\"'，。；;:,-")

def pick_candidate(chunks):
    """
    Choose the best 'source-like' chunk:
    - prefer the longest non-empty, non-punctuation chunk
    - length >= 5 characters (tweakable)
    """
    cleaned = [clean_text(c) for c in chunks if isinstance(c, str)]
    # drop empty / punctuation-only
    cleaned = [c for c in cleaned if c and not re.fullmatch(ONLY_PUNCT_RE, c)]
    if not cleaned:
        return ""
    # choose the longest
    return max(cleaned, key=len)

# 1) Build a mask: source is NaN and quote contains a ':*' continuation
mask_continuation = df_new["source"].isna() & df_new["quote"].astype(str).str.contains(CONT_SPLIT_RE)

# 2) Preview what would change (no mutations yet)
preview = []
for idx in df_new[mask_continuation].index:
    q = df_new.at[idx, "quote"]
    parts = CONT_SPLIT_RE.split(q)
    if len(parts) < 2:
        continue
    # everything after the first chunk are candidates for source
    candidate = pick_candidate(parts[1:])
    if candidate:
        preview.append({
            "index": idx,
            "old_quote": q,
            "new_quote": clean_text(parts[0]),
            "new_source_candidate": candidate
        })

preview_df = pd.DataFrame(preview)
display(preview_df.head(20))  # <- review here
print(f"Rows with candidate source found: {len(preview_df)}")


Unnamed: 0,index,old_quote,new_quote,new_source_candidate
0,1412,". . WHEREAS, IN THE PAST, THE POINT OF DISAGREEMENT HAS BEEN BETWEEN DISSONANCE AND CONSONANCE, IT WILL BE, IN THE IMMEDIATE FUTURE, BETWEEN NOISE AND SO-CALLED MUSICAL SOUNDS. : THE PRESENT METHODS OF WRITING MUSIC, PRINCIPALLY THOSE WHICH EMPLOY HARMONY AND ITS REFERENCE TO PARTICULAR STEPS IN THE FIELD OF SOUND, WILL BE INADEQUATE FOR THE COMPOSER, WHO WILL BE FACED WITH THE ENTIRE FIELD OF SOUND. :** Quote of John Cage, in: 'The Future of Music: Credo' (1937); SILENCE; lectures and writings by Cage, John', Publisher Middletown, Conn. Wesleyan University Press, June 1961, CREDO/3",". . WHEREAS, IN THE PAST, THE POINT OF DISAGREEMENT HAS BEEN BETWEEN DISSONANCE AND CONSONANCE, IT WILL BE, IN THE IMMEDIATE FUTURE, BETWEEN NOISE AND SO-CALLED MUSICAL SOUNDS. : THE PRESENT METHODS OF WRITING MUSIC, PRINCIPALLY THOSE WHICH EMPLOY HARMONY AND ITS REFERENCE TO PARTICULAR STEPS IN THE FIELD OF SOUND, WILL BE INADEQUATE FOR THE COMPOSER, WHO WILL BE FACED WITH THE ENTIRE FIELD OF SOUND.","Quote of John Cage, in: 'The Future of Music: Credo' (1937); SILENCE; lectures and writings by Cage, John', Publisher Middletown, Conn. Wesleyan University Press, June 1961, CREDO/3"
1,1413,"The composer (organizer of sound) will be faced not only with the entire field of sound but also with the entire field of time. The 'frame' or fraction of a second, following established film technique, will probably be the basic unit in the measurement of time. No rhythm will be beyond the composer's reach. : NEW METHODS WILL BE DISCOVERED, BEARING A DEFINITE RELATION TO SCHOENBERG'S TWELVE-TONE SYSTEM: * In: 'The Future of Music: Credo' (1937); in: 'Silence: lectures and writings by Cage, John', Publisher Middletown, Conn. Wesleyan University Press, June 1961, 4/SILENCE","The composer (organizer of sound) will be faced not only with the entire field of sound but also with the entire field of time. The 'frame' or fraction of a second, following established film technique, will probably be the basic unit in the measurement of time. No rhythm will be beyond the composer's reach. : NEW METHODS WILL BE DISCOVERED, BEARING A DEFINITE RELATION TO SCHOENBERG'S TWELVE-TONE SYSTEM","In: 'The Future of Music: Credo' (1937); in: 'Silence: lectures and writings by Cage, John', Publisher Middletown, Conn. Wesleyan University Press, June 1961, 4/SILENCE"
2,3812,"Able was I ere I saw Elba. :* Credited to ""J. T. R. "" of Baltimore, 1848The Golden Rule and Gazette Of The Union, Saturday July 8 1848, page 30, article titled ""Ingenious arrangement of words""; https: //www. google. co. uk/books/edition/The_Golden_Rule_and_Odd_fellows_Family_C/hEg2AQAAMAAJ? hl=en&gbpv=1&dq=able%20was%20i%20ere%20i%20saw%20elba: * Of such attributions to Napoleon, there is little credence, as stated by William Irvine in Madam I'm Adam and Other Palindromes (1987): ""The well-known ABLE WAS I, ERE I SAW ELBA, for example, is conveniently attributed to Napoleon, whose knowledge of English wordplay was certainly questionable, at best. "" There is no mention of such a palindrome in O'Meara's own work, Napoleon in Exile: or, A Voice from St. Helena (1822).",Able was I ere I saw Elba.,"Of such attributions to Napoleon, there is little credence, as stated by William Irvine in Madam I'm Adam and Other Palindromes (1987): ""The well-known ABLE WAS I, ERE I SAW ELBA, for example, is conveniently attributed to Napoleon, whose knowledge of English wordplay was certainly questionable, at best. "" There is no mention of such a palindrome in O'Meara's own work, Napoleon in Exile: or, A Voice from St. Helena (1822)."
3,4567,"Through all the years that I have been in business I have never yet found our business bad as a result of any outside force. It has always been due to some defect in our own company, and whenever we located and repaired that defect our business became good again - regardless of what anyone else might be doing. And it will always be found that this country has nationally bad business when business men are drifting, and that business is good when men take hold of their own affairs, put leadership into them, and push forward in spite of obstacles. Only disaster can result when the fundamental principles of business are disregarded and what looks like the easiest way is taken. These fundamentals, as I see them, are: :(1) To make an ever increasingly large quantity of goods of the best possible quality, to make them in the best and most economical fashion, and to force them out onto the market. :(2) To strive always for higher quality and lower prices as well as lower costs. :(3) To raise wages gradually but continuously B and never to cut them. :(4) To get the goods to the consumer in the most economical manner so that the benefits of low cost production may reach him. :These fundamentals are all summed up in the single word 'service'. . The service starts with discovering what people need and then supplying that need according to the principles that have just been given. :* Henry Ford in: Justus George Frederick (1930), A Philosophy of Production: A Symposium, p. 32; as cited in: Morgen Witzel (2003) Fifty Key Figures in Management. p. 196","Through all the years that I have been in business I have never yet found our business bad as a result of any outside force. It has always been due to some defect in our own company, and whenever we located and repaired that defect our business became good again - regardless of what anyone else might be doing. And it will always be found that this country has nationally bad business when business men are drifting, and that business is good when men take hold of their own affairs, put leadership into them, and push forward in spite of obstacles. Only disaster can result when the fundamental principles of business are disregarded and what looks like the easiest way is taken. These fundamentals, as I see them, are: :(1) To make an ever increasingly large quantity of goods of the best possible quality, to make them in the best and most economical fashion, and to force them out onto the market. :(2) To strive always for higher quality and lower prices as well as lower costs. :(3) To raise wages gradually but continuously B and never to cut them. :(4) To get the goods to the consumer in the most economical manner so that the benefits of low cost production may reach him. :These fundamentals are all summed up in the single word 'service'. . The service starts with discovering what people need and then supplying that need according to the principles that have just been given.","Henry Ford in: Justus George Frederick (1930), A Philosophy of Production: A Symposium, p. 32; as cited in: Morgen Witzel (2003) Fifty Key Figures in Management. p. 196"
4,7029,"""To the German commander. :: Nuts! : From the American commander. "": * His famous reply to the German demand for surrender of the surrounded US 101st Airborne Division at Bastogne in the Battle of the Bulge (22 December 1944), as quoted in Bastogne: The Story of the First Eight Days In Which the 101st Airborne Division Was Closed Within the Ring of German Forces (1946) by Colonel S. L. A. Marshal, Ch. 14; delivering the message Colonel Joseph H. Harper was asked ""What does that mean? . . Is this affirmative or negative? "" and replied ""Definitely not affirmative. """,To the German commander. :: Nuts! : From the American commander.,"His famous reply to the German demand for surrender of the surrounded US 101st Airborne Division at Bastogne in the Battle of the Bulge (22 December 1944), as quoted in Bastogne: The Story of the First Eight Days In Which the 101st Airborne Division Was Closed Within the Ring of German Forces (1946) by Colonel S. L. A. Marshal, Ch. 14; delivering the message Colonel Joseph H. Harper was asked ""What does that mean? . . Is this affirmative or negative? "" and replied ""Definitely not affirmative."
5,7399,"One of the best things I read was an 1889 essay by Andrew Carnegie called The Gospel of Wealth. It makes the case that the wealthy have a responsibility to return their resources to society, a radical idea at the time that laid the groundwork for philanthropy as we know it today. :In the essay's most famous line, Carnegie argues that ""the man who dies thus rich dies disgraced. "" I have spent a lot of time thinking about that quote lately. People will say a lot of things about me when I die, but I am determined that ""he died rich"" will not be one of them. There are too many urgent problems to solve for me to hold onto resources that could be used to help people. :That is why I have decided to give my money back to society much faster than I had originally planned. I will give away virtually all my wealth through the Gates Foundation over the next 20 years to the cause of saving and improving lives around the world. :*As quoted in Gates Notes (May 8, 2025)","One of the best things I read was an 1889 essay by Andrew Carnegie called The Gospel of Wealth. It makes the case that the wealthy have a responsibility to return their resources to society, a radical idea at the time that laid the groundwork for philanthropy as we know it today. :In the essay's most famous line, Carnegie argues that ""the man who dies thus rich dies disgraced. "" I have spent a lot of time thinking about that quote lately. People will say a lot of things about me when I die, but I am determined that ""he died rich"" will not be one of them. There are too many urgent problems to solve for me to hold onto resources that could be used to help people. :That is why I have decided to give my money back to society much faster than I had originally planned. I will give away virtually all my wealth through the Gates Foundation over the next 20 years to the cause of saving and improving lives around the world.","As quoted in Gates Notes (May 8, 2025)"
6,7577,"In German, a young lady has no sex, while a turnip has. :*Appendix D, The Awful German Language","In German, a young lady has no sex, while a turnip has.","Appendix D, The Awful German Language"
7,7625,"This last summer, when I was on my way back to Vienna from the Appetite-Cure in the mountains, I fell over a cliff in the twilight, and broke some arms and legs and one thing or another, and by good luck was found by some peasants who had lost an ass, and they carried me to the nearest habitation, which was one of those large, low, thatch-roofed farm-houses, with apartments in the garret for the family, and a cunning little porch under the deep gable decorated with boxes of bright colored flowers and cats; on the ground floor a large and light sitting-room, separated from the milch-cattle apartment by a partition; and in the front yard rose stately and fine the wealth and pride of the house, the manure-pile. That sentence is Germanic, and shows that I am acquiring that sort of mastery of the art and spirit of the language which enables a man to travel all day in one sentence without changing cars. :*Book I, Ch. 1","This last summer, when I was on my way back to Vienna from the Appetite-Cure in the mountains, I fell over a cliff in the twilight, and broke some arms and legs and one thing or another, and by good luck was found by some peasants who had lost an ass, and they carried me to the nearest habitation, which was one of those large, low, thatch-roofed farm-houses, with apartments in the garret for the family, and a cunning little porch under the deep gable decorated with boxes of bright colored flowers and cats; on the ground floor a large and light sitting-room, separated from the milch-cattle apartment by a partition; and in the front yard rose stately and fine the wealth and pride of the house, the manure-pile. That sentence is Germanic, and shows that I am acquiring that sort of mastery of the art and spirit of the language which enables a man to travel all day in one sentence without changing cars.","Book I, Ch. 1"
8,7626,"No one doubts—certainly not I—that the mind exercises a powerful influence over the body. From the beginning of time, the sorcerer, the interpreter of dreams, the fortune-teller, the charlatan, the quack, the wild medicine-man, the educated physician, the mesmerist, and the hypnotist have made use of the client's imagination to help them in their work. They have all recognized the potency and availability of that force. Physicians cure many patients with a bread pill; they know that where the disease is only a fancy, the patient's confidence in the doctor will make the bread pill effective. :*Book I, Ch. 4","No one doubts—certainly not I—that the mind exercises a powerful influence over the body. From the beginning of time, the sorcerer, the interpreter of dreams, the fortune-teller, the charlatan, the quack, the wild medicine-man, the educated physician, the mesmerist, and the hypnotist have made use of the client's imagination to help them in their work. They have all recognized the potency and availability of that force. Physicians cure many patients with a bread pill; they know that where the disease is only a fancy, the patient's confidence in the doctor will make the bread pill effective.","Book I, Ch. 4"
9,7627,"When I was a boy a farmer's wife who lived five miles from our village had great fame as a faith-doctor—that was what she called herself. Sufferers came to her from all around, and she laid her hand upon them and said, ""Have faith—it is all that is necessary, "" and they went away well of their ailments. She was not a religious woman, and pretended to no occult powers. She said that the patient's faith in her did the work. Several times I saw her make immediate cures of severe toothaches. My mother was the patient. In Austria there is a peasant who drives a great trade in this sort of industry, and has both the high and the low for patients. He gets into prison every now and then for practising without a diploma, but his business is as brisk as ever when he gets out, for his work is unquestionably successful and keeps his reputation high. In Bavaria there is a man who performed so many great cures that he had to retire from his profession of stage-carpentering in order to meet the demand of his constantly increasing body of customers. He goes on from year to year doing his miracles, and has become very rich. He pretends to no religious helps, no supernatural aids, but thinks there is something in his make-up which inspires the confidence of his patients, and that it is this confidence which does the work, and not some mysterious power issuing from himself. :*Ch. 4","When I was a boy a farmer's wife who lived five miles from our village had great fame as a faith-doctor—that was what she called herself. Sufferers came to her from all around, and she laid her hand upon them and said, ""Have faith—it is all that is necessary, "" and they went away well of their ailments. She was not a religious woman, and pretended to no occult powers. She said that the patient's faith in her did the work. Several times I saw her make immediate cures of severe toothaches. My mother was the patient. In Austria there is a peasant who drives a great trade in this sort of industry, and has both the high and the low for patients. He gets into prison every now and then for practising without a diploma, but his business is as brisk as ever when he gets out, for his work is unquestionably successful and keeps his reputation high. In Bavaria there is a man who performed so many great cures that he had to retire from his profession of stage-carpentering in order to meet the demand of his constantly increasing body of customers. He goes on from year to year doing his miracles, and has become very rich. He pretends to no religious helps, no supernatural aids, but thinks there is something in his make-up which inspires the confidence of his patients, and that it is this confidence which does the work, and not some mysterious power issuing from himself.",Ch. 4


Rows with candidate source found: 46


In [103]:
for row in preview:
    idx = row["index"]
    df_new.at[idx, "quote"]  = row["new_quote"]
    df_new.at[idx, "source"] = row["new_source_candidate"]

# (optional) tidy any quotes that became empty/punct-only after trimming
mask_bad_quote = df_new["quote"].astype(str).str.fullmatch(ONLY_PUNCT_RE) | df_new["quote"].isna()
# decide whether to drop or leave them; example drops them:
df_new = df_new[~mask_bad_quote].reset_index(drop=True)

In [104]:
mask_bad_quote

0        False
1        False
2        False
3        False
4        False
         ...  
29670    False
29671    False
29672    False
29673    False
29674    False
Name: quote, Length: 29675, dtype: bool

Now let's see the rows which have sources length less than 5 but other than NaN

In [105]:
# rows with non-NaN source and length < 5 (after stripping whitespace)
mask_src_short = df_new["source"].notna() & (df_new["source"].astype(str).str.strip().str.len() < 5)

short_sources = df_new[mask_src_short]
print(f"Rows found: {len(short_sources)}")
display(short_sources)

Rows found: 248


Unnamed: 0,quote,source,author,heading_context
355,"School failed me, and I failed the school. It bored me. The teachers behaved like Feldwebel (sergeants). I wanted to learn what I wanted to know, but they wanted me to learn for the exam. What I hated most was the competitive system there, and especially sports. Because of this, I wasn't worth anything, and several times they suggested I leave. This was a Catholic School in Munich. I felt that my thirst for knowledge was being strangled by my teachers; grades were their only measurement. How can a teacher understand youth with such a system? . . from the age of twelve I began to suspect authority and distrust teachers. I learned mostly at home, first from my uncle and then from a student who came to eat with us once a week. He would give me books on physics and astronomy. The more I read, the more puzzled I was by the order of the universe and the disorder of the human mind, by the scientists who didn't agree on the how, the when, or the why of creation. Then one day this student brought me Kant's Critique of Pure Reason. Reading Kant, I began to suspect everything I was taught. I no longer believed in the known God of the Bible, but rather in the mysterious God expressed in nature.",p. 8,Albert Einstein,Einstein and the Poet (1983)
1636,"The conception of the necessary unit of all that is resolves itself into the poverty of the imagination, and a freer logic emancipates us from the straitwaistcoated benevolent institution which idealism palms off as the totality of being.",p. 9,Bertrand Russell,Our Knowledge of the External World (1914)
1637,"The true function of logic. . as applied to matters of experience. . is analytic rather than constructive; taken a priori, it shows the possibility of hitherto unsuspected alternatives more often than the impossibility of alternatives which seemed prima facie possible. Thus, while it liberates imagination as to what the world may be, it refuses to legislate as to what the world is.",p. 8,Bertrand Russell,Our Knowledge of the External World (1914)
1985,"When I was a child the atmosphere in the house was one of puritan piety and austerity. There were family prayers at eight o'clock every morning. Although there were eight servants, food was always of Spartan simplicity, and even what there was, if it was at all nice, was considered too good for children. For instance, if there was apple tart and rice pudding, I was only allowed the rice pudding. Cold baths all the year round were insisted upon, and I had to practice the piano from seven-thirty to eight every morning although the fires were not yet lit. My grandmother never allowed herself to sit in an armchair until the evening. Alcohol and tobacco were viewed with disfavor although stern convention compelled them to serve a little wine to guests. Only virtue was prized, virtue at the expense of intellect, health, happiness, and every mundane good.",p. 9,Bertrand Russell,Portraits from Memory and Other Essays (1956)
1986,"I was a solitary, shy, priggish youth. I had no experience of the social pleasures of boyhood and did not miss them. But I liked mathematics, and mathematics was suspect because it has no ethical content. I came also to disagree with the theological opinions of my family, and as I grew up I became increasingly interested in philosophy, of which they profoundly disapproved. Every time the subject came up they repeated with unfailing regularity, 'What is mind? No matter. What is matter? Never mind. ' After some fifty or sixty repetitions, this remark ceased to amuse me.",p. 9,Bertrand Russell,Portraits from Memory and Other Essays (1956)
...,...,...,...,...
27269,"Our language can be seen as an ancient city: a maze of little streets and squares, of old and new houses, and of houses with additions from various periods; and this surrounded by a multitude of new boroughs with straight regular streets and uniform houses.",§ 18,Ludwig Wittgenstein,Philosophical Investigations (1953)
27271,"Don't say: ""They must have something in common, or they would not be called 'games'"" but look and see whether there is anything common to all. For if you look at them, you won't see something that is common to all, but similarities, affinities, and a whole series of them at that. To repeat: don't think, but look!",§ 66,Ludwig Wittgenstein,Philosophical Investigations (1953)
28266,"Titan! to whom immortal eyes The sufferings of mortality Seen in their sad reality, Were not as things that gods despise; What was thy pity's recompense? A silent suffering, and intense; The rock, the vulture, and the chain, All that the proud can feel of pain, The agony they do not show, The suffocating sense of woe, Which speaks but in its loneliness, And then is jealous lest the sky Should have a listener, nor will sigh Until its voice is echoless.",I.,Lord Byron,Prometheus (1816)
28267,"Titan! to thee the strife was given Between the suffering and the will, Which torture where they cannot kill; And the inexorable Heaven, And the deaf tyranny of Fate, The ruling principle of Hate, Which for its pleasure doth create The things it may annihilate, Refused thee even the boon to die: The wretched gift eternity Was thine — and thou hast borne it well. All that the Thunderer wrung from thee Was but the menace which flung back On him the torments of thy rack; The fate thou didst so well foresee, But would not to appease him tell; And in thy Silence was his Sentence, And in his Soul a vain repentance, And evil dread so ill dissembled, That in his hand the lightnings trembled.",II.,Lord Byron,Prometheus (1816)


Let's check if the source have these pages, chapter, I,II and heading context have NaN. We will drop them too

In [106]:
print(df_new.isna().sum())

quote                 0
source             6770
author                0
heading_context    9115
dtype: int64


In [107]:

# ONLY_PUNCT_RE   = r'^[^\w\u4e00-\u9fff]+$'              # only punctuation/space
CONT_SPLIT_RE   = re.compile(r'\s*[:：]\s*\*+\s*')       # :*, ：*, : **, etc.

def clean_text(s: str) -> str:
    if not isinstance(s, str):
        return ""
    s = re.sub(r"\s+", " ", s).strip()
    return s.strip(" \t\n\r\"'，。；;:,-")

def pick_candidate(chunks):
    """Pick the longest non-empty, non-punct-only chunk."""
    cleaned = [clean_text(c) for c in chunks if isinstance(c, str)]
    cleaned = [c for c in cleaned if c and not re.fullmatch(ONLY_PUNCT_RE, c)]
    return max(cleaned, key=len) if cleaned else ""

# literals that are never meaningful as a full source *by themselves*
LITERAL_BAD_SOURCES = {
    ".", ",", "'.", "<",
    "Couplet", "Epilogue", "proverb",
    "Lennox", "Prospero", "Minotaur", "Cicero", "Björk", "Daniel",
}

# regexes for common “not meaningful” patterns (case-insensitive)
BAD_SOURCE_PATTERNS = [
    re.compile(r"^\s*scene\s*[ivxlcdm]+\s*$", re.I),           # Scene iv, Scene III, etc.
    re.compile(r"^\s*chapter\s*\d+\s*$", re.I),                # Chapter 1, Chapter 12
    re.compile(r"^\s*,?\s*p\.\s*(?:\d+|[ivxlcdm]+)(?:-\d+)?\s*$", re.I),  # p. 11, , p. xcv, p. 50-51
    re.compile(r"^\s*ibid\.?\s*$", re.I),                      # Ibid.
    re.compile(r"^\s*cited in\b.*$", re.I),                    # Cited in ...
    re.compile(r"^\s*\{\{cite.*$", re.I),                      # {{cite ...}}
    re.compile(ONLY_PUNCT_RE),                                 # only punctuation
    re.compile(r"^\s*[ivxlcdm]+\s*$", re.I),                   # bare Roman numeral: I, II, III...
    re.compile(r"^\s*\(2006\)\.\s*$", re.I),                   # (2006).
]

def source_is_meaningless(x) -> bool:
    """True if source is NaN OR one of the known 'meaningless' forms above."""
    if pd.isna(x):
        return True
    t = str(x).strip()
    if not t:
        return True
    if t in LITERAL_BAD_SOURCES:
        return True
    low = t.lower()
    if low in {s.lower() for s in LITERAL_BAD_SOURCES}:
        return True
    for pat in BAD_SOURCE_PATTERNS:
        if pat.fullmatch(t):
            return True
    return False

# --- build masks -----------------------------------------------------------

# "no heading context"
mask_no_context = df_new["heading_context"].isna()

# "no author" (strict NaN). If you also want to treat 'unknown' as missing, add .str.lower().eq('unknown')
mask_no_author  = df_new["author"].isna()

# source meaningless (includes NaN)
mask_bad_source = df_new["source"].apply(source_is_meaningless)

# rows to handle
mask_targets = mask_no_context & mask_no_author & mask_bad_source
targets = df_new[mask_targets]
print(f"Candidates to handle (bad source + no context/author): {len(targets)}")
# display(targets.head(20))  # uncomment to preview


Candidates to handle (bad source + no context/author): 0


In [114]:

cite_book_only = re.compile(r'^\s*\{\{\s*cite\s*book\b.*?\}\}\s*$', re.IGNORECASE | re.DOTALL)

# 2) Only an unclosed/truncated {{cite book ...}} tail
cite_book_truncated_only = re.compile(r'^\s*\{\{\s*cite\s*boo?k?\s*$', re.IGNORECASE)

mask_cite_only = df_new["source"].fillna("").str.match(cite_book_only)
mask_cite_trunc = df_new["source"].fillna("").str.match(cite_book_truncated_only)

rows_cite_only = df_new[mask_cite_only | mask_cite_trunc]

print(f"Rows where source is ONLY a cite book template (closed or truncated): {len(rows_cite_only)}")
display(rows_cite_only[["quote","author","source","heading_context"]])


Rows where source is ONLY a cite book template (closed or truncated): 0


Unnamed: 0,quote,author,source,heading_context


In [113]:
# setting them to NaN
df_new.loc[mask_cite_only | mask_cite_trunc, "source"] = np.nan


In [117]:
df_new.head(5)

Unnamed: 0,quote,source,author,heading_context
0,Un homme heureux est trop content du présent pour trop se soucier de l'avenir.,"A happy man is too satisfied with the present to dwell too much on the future. From ""Mes Projets d'Avenir"", a French essay written at age 18 for a school exam (18 September 1896). The Collected Papers of Albert Einstein Vol. 1 (1987) Doc. 22.",Albert Einstein,1890s
1,"Autoritätsdusel ist der größte Feind der Wahrheit. :: Another translation: Authority gone to one's head is the greatest enemy of truth. (Collected Papers, Volume 1, 1987): * Letter to Jost Winteler (July 8th, 1901), quoted in The Private Lives of Albert Einstein by Roger Highfields and Paul Carter (1993), p. 79. Einstein had been annoyed that Paul Drude, editor of Annalen der Physik, had dismissed some criticisms Einstein made of Drude's electron theory of metals.",Blind obedience to authority is the greatest enemy of truth.,Albert Einstein,1900s
2,"Lieber Habicht! / Es herrscht ein weihevolles Stillschweigen zwischen uns, so daß es mir fast wie eine sündige Entweihung vorkommt, wenn ich es jetzt durch ein wenig bedeutsames Gepappel unterbreche. . / Was machen Sie denn, Sie eingefrorener Walfisch, Sie getrocknetes, eingebüchstes Stück Seele. . ?","Dear Habicht, / Such a solemn air of silence has descended between us that I almost feel as if I am committing a sacrilege when I break it now with some inconsequential babble. . / What are you up to, you frozen whale, you smoked, dried, canned piece of soul. . ? Opening of a letter to his friend Conrad Habicht in which he describes his four revolutionary Annus Mirabilis papers (18 or 25 May 1905) Doc. 27",Albert Einstein,1900s
3,E=mc²,"The equation originally expressed the equivalence of mass and energym = L/c², which easily translates into the far more well-known E = mc² in Does the Inertia of a Body Depend Upon Its Energy Content? published in the Annalen der Physik (27 September 1905): ""If a body gives off the energy L in the form of radiation, its mass diminishes by L/c². "" In a later statement explaining the ideas expressed by this equation, Einstein summarized: ""It followed from the special theory of relativity that mass and energy are both but different manifestations of the same thing — a somewhat unfamiliar conception for the average mind. Furthermore, the equation E = mc², in which energy is put equal to mass, multiplied by the square of the velocity of light, showed that very small amounts of mass may be converted into a very large amount of energy and vice versa. The mass and energy were equivalent, according to the formula mentioned before. This was demonstrated by Cockcroft and Walton in 1932, experimentally. "" Atomic Physics (1948) by the J. Arthur Rank Organisation, Ltd. (Voice of A. Einstein. )",Albert Einstein,1900s
4,The mass of a body is a measure of its energy content.,"Ist die Trägheit eines Körpers von seinem Energieinhalt abhängig? (""Does the inertia of a body depend upon its energy content? "") Annalen der Physik 18, 639-641 (1905). Quoted in Concepts of Mass in Classical and Modern Physics by Max Jammer (1961), p. 177",Albert Einstein,1900s


In [118]:
df_new.to_csv("Preprocessed_quotes_dataframe.csv", index=False)
print("Saved:", "Preprocessed_quotes_dataframe.csv")

Saved: Preprocessed_quotes_dataframe.csv


In [120]:
df_2 = pd.read_csv("extracted_quotes_about.csv" , index_col = False)

In [121]:
df2_auth = (
    df_2
    .dropna(subset=["author"])  # only rows that actually have an author
    .drop_duplicates(subset=["quote", "source", "heading_context"], keep="first")
    [["quote", "source", "heading_context", "author"]]
    .rename(columns={"author": "author_new"})
)

# 2) Join onto df_new
merged = df_new.merge(
    df2_auth,
    on=["quote", "source", "heading_context"],
    how="left",
)

# 3) Preview planned changes
to_update = merged["author_new"].notna() & (merged["author"].fillna("") != merged["author_new"].fillna(""))
print(f"Rows to update: {to_update.sum()}")
display(merged.loc[to_update, ["quote", "source", "heading_context", "author", "author_new"]].head(20))


Rows to update: 5288


Unnamed: 0,quote,source,heading_context,author,author_new
464,"These days it is common knowledge that short waves are more powerful than long ones, as the very short ones, known as x-rays, damage living tissues. It took half-a-century to learn this fact: it was one of the great discoveries of young Albert Einstein of 1905. When he announced it leading researchers found it most incredible. .","Joseph Agassi, Radiation Theory and the Quantum Revolution (1993)",,unknown,About Albert Einstein
465,". . do not be impressed by the imprint of a famous publishing house or the volumes of an author's publications. Bear in mind that Einstein needed only seventeen pages for his contribution which revolutionized physics, while there are graphomanics in asylums who use up mounds of paper every day.","Stanislav Andreski, The Social Sciences as Sorcery (1972, London: Deutsch), p 86",,unknown,About Albert Einstein
466,"Paula Gunn Allen's description of the tribal culture is helpful in understanding this concept of energy dispersal: ""The closest analogy in Western thought is the Einsteinian understanding of matter as a special state or condition of energy. Yet even this concept falls short of the Native American understanding, for Einsteinian energy is essentially stupid, while energy in the Indian view is intelligence manifesting yet another way. ""","Bettina Aptheker Tapestries of Life: Women's Work, Women's Consciousness, and the Meaning of Daily Experience (1989)",,unknown,About Albert Einstein
467,The astonishing thing about Einstein's equations is that they appear to have come out of nothing.,"Ernest Barnes, as quoted by Gerald James Whitrow, The Structure of the Universe: An Introduction to Cosmology (1949)",,unknown,About Albert Einstein
468,"[During 1940s], Einstein was pursuing what he called his 'violon d'Ingres'—his uniﬁed ﬁeld theory. . The so-called strange particles were just being discovered, and the quantum theory was proving ever more powerful. Einstein simply was not much interested. His position was that it was useless to try to understand this new physics until the electron was understood. We now believe that understanding the electron is such an intimate part of the new physics that the electron cannot be understood by itself. But Besso took all his old friend's attempts extremely seriously, and Einstein gave him detailed explanations of his various formal manipulations. It was a dialogue that somehow reminds me of the plays of Samuel Beckett.","Jeremy Bernstein, Quantum Profiles, pp. 157-158.",,unknown,About Albert Einstein
469,"I was particularly won over by his sweet disposition, by his general kindness, by his simplicity, and by his friendliness. Occasionally, gaiety would gain the upper hand and he would strike a more personal note and even disclose some detail of his day-to-day life. Then again, reverting to his characteristic mood of reflection and meditation, he would launch into a profound and original discussion of a variety of scientific and other problems. I shall always remember the enchantment of all those meetings, from which I carried away an indelible impression of Einstein's great human qualities.","Louis de Broglie, New Perspectives in Physics, p. 182",,unknown,About Albert Einstein
470,"It is almost impertinent to talk of the ascent of man in the presence of two men, Newton and Einstein, who stride like gods. Of the two, Newton is the Old Testament god; it is Einstein who is the New Testament figure. He was full of humanity, pity, a sense of enormous sympathy. His vision of nature herself was that of a human being in the presence of something god-like, and that is what he always said about nature. He was fond of talking about God: 'God does not play at dice', 'God is not malicious'. Finally Niels Bohr one day said to him, 'Stop telling God what to do'. But that is not quite fair. Einstein was a man who could ask immensely simple questions. And what his life showed, and his work, is that when the answers are simple too, then you hear God thinking.","Jacob Bronowski, The Ascent of Man (1974), Ch. 7: The Majestic Clockwork",,unknown,About Albert Einstein
471,Like many other great scientists he does not fit the boxes in which popular polemicists like to pigeonhole him. . . It is clear for example that he had respect for the religious values enshrined within Judaic and Christian traditions. . but what he understood by religion was something far more subtle than what is usually meant by the word in popular discussion.,"John Brooke, as quoted in ""Childish superstition: Einstein's letter makes view of religion relatively clear"" in The Guardian (13 May 2008)",,unknown,About Albert Einstein
472,"Some people have reported that Einstein was quite a good musician, but others weren't so enthusiastic. A professional violinist claimed he ""fiddled like a lumberjack""; a famous pianist playing with him demanded, ""For heaven's sake Albert, can't you count? ""; and a music critic in Berlin, thinking Einstein was famous for his violin playing rather than physics, judged that ""Einstein's playing is excellent, but he does not deserve world fame; there are many others just as good. ""","Alice Calaprice & Trevor Lipscombe, Albert Einstein: A Biography (2005)",,unknown,About Albert Einstein
473,"A niece of Einstein's, in India during the 1960s, paid a special visit to the headquarters of the Theosophical Society at Adyar. She explained that she knew nothing of theosophy or the society, but had to see the place because her uncle always had a copy of Madame Blavatsky's Secret Doctrine on his desk. The individual to whom the niece spoke was Eunice Layton, a world-traveled theosophical lecturer who happened to be at the reception desk when she arrived.","Sylvia Cranston HPB - The Extraordinary Life and Influence of Helena Blavatsky, Founder of the Modern Theosophical Movement (New York: Putnam, 1994), p. 557-558.",,unknown,About Albert Einstein


In [122]:
merged.loc[to_update, "author"] = merged.loc[to_update, "author_new"]
df_new_updated = merged.drop(columns=["author_new"]).reset_index(drop=True)

In [123]:
merged.loc[to_update, "author"] = merged.loc[to_update, "author_new"]
df_new_updated.shape

(29675, 4)

Let's also trim the extra starting and ending spaces and punctuations from the quotes before saving.

In [125]:
EDGE_CHARS = ' \t\r\n.,;:!?\'"“”‘’«»‹›()[]{}<>|/\\·•…`~--–—-'

def trim_edges(series: pd.Series) -> pd.Series:
    s = series.astype("object")               # keeps NaNs
    mask = s.notna()
    s.loc[mask] = (
        s.loc[mask].astype(str)
        .str.replace(r"\s+", " ", regex=True) 
        .str.strip(EDGE_CHARS)                
    )
    return s

df_new_updated["quote"]  = trim_edges(df_new_updated["quote"])



In [129]:
def parse_author(author):
    if pd.isna(author):
        return "canonical", "unknown"

    a = author.strip()
    lower = a.lower()

    if lower.startswith("about "):
        return "about", a[6:].strip()
    elif lower.startswith("misattributed to "):
        return "misattributed", a[17:].strip()
    elif lower.startswith("disputed with "):
        return "disputed", a[13:].strip()
    elif lower == "unknown":
        return "canonical", "unknown"
    else:
        return "canonical", a  # plain confirmed name

df_new_updated[["status", "target_name"]] = df_new_updated["author"].apply(
    lambda x: pd.Series(parse_author(x))
)

# now save with spread columns
df_new_updated.to_csv("quotes_spread.csv", index=False)

In [134]:
df_new_updated["quote"] = df_new_updated["quote"].astype(str).str.strip()
df_new_updated["source"] = df_new_updated["source"].astype(str).fillna("").str.strip()

# Lengths
q_len = df_new_updated["quote"].str.len()
s_len = df_new_updated["source"].str.len()

# 1) Drop long quotes
removed_quotes = (q_len > 1000).sum()
df_new = df_new_updated[q_len <= 1000].copy()

# 2) Keep row, but cap source to 1000 chars (or 300 if you prefer)
df_new["source"] = df_new_updated["source"].str.slice(0, 1000)

print(f"Removed {removed_quotes} quotes >1000 chars; kept {len(df_new)} rows.")

Removed 1813 quotes >1000 chars; kept 27862 rows.


In [135]:
df_new.shape

(27862, 6)

In [137]:
df_new.to_csv("quotes_database_final.csv", index=False)

In [126]:
df_new_updated.to_csv("Preprocessed_quotes_dataframe_updated_1.csv", index=False)
print("Saved:", "Preprocessed_quotes_dataframe_updated_1.csv")

Saved: Preprocessed_quotes_dataframe_updated_1.csv


In [138]:
pip install fastapi

Collecting fastapiNote: you may need to restart the kernel to use updated packages.

  Downloading fastapi-0.116.1-py3-none-any.whl.metadata (28 kB)
Collecting starlette<0.48.0,>=0.40.0 (from fastapi)
  Downloading starlette-0.47.2-py3-none-any.whl.metadata (6.2 kB)
Downloading fastapi-0.116.1-py3-none-any.whl (95 kB)
Downloading starlette-0.47.2-py3-none-any.whl (72 kB)
Installing collected packages: starlette, fastapi

   ---------------------------------------- 0/2 [starlette]
   ---------------------------------------- 0/2 [starlette]
   ---------------------------------------- 0/2 [starlette]
   -------------------- ------------------- 1/2 [fastapi]
   -------------------- ------------------- 1/2 [fastapi]
   -------------------- ------------------- 1/2 [fastapi]
   -------------------- ------------------- 1/2 [fastapi]
   ---------------------------------------- 2/2 [fastapi]

Successfully installed fastapi-0.116.1 starlette-0.47.2


In [139]:
pip install uvicorn 

Collecting uvicornNote: you may need to restart the kernel to use updated packages.

  Downloading uvicorn-0.35.0-py3-none-any.whl.metadata (6.5 kB)
Downloading uvicorn-0.35.0-py3-none-any.whl (66 kB)
Installing collected packages: uvicorn
Successfully installed uvicorn-0.35.0


In [140]:
pip install neo4j

Collecting neo4jNote: you may need to restart the kernel to use updated packages.

  Downloading neo4j-5.28.2-py3-none-any.whl.metadata (5.9 kB)
Downloading neo4j-5.28.2-py3-none-any.whl (313 kB)
Installing collected packages: neo4j
Successfully installed neo4j-5.28.2


What I did above is that, first time I extracted the information frOm the dump when the heading was quote about I set the author as unknown. I reextracted the information after changing the extractor little bit. This time I set the author as about title when the heading is quote about or quote for. And then merged replaced in the new dataframe.

* Now the preprocessing is almost done. The only thing I have to decide is the len of the quote and source.
* And the nodes and the properties of the graph datbase.