## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [2]:
# File location and type
file_location = "/FileStore/tables/Try_data.csv"
file_type = "csv"

# CSV options
infer_schema = "false"
first_row_is_header = "False"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

_c0,_c1,_c2
Product,Consumer complaint narrative,Clean Text
Debt collection,transworld systems inc.,
is trying to collect a debt that is not mine,"not owed and is inaccurate.""",transworld system inc. trying collect debt mine owed inaccurate
"Credit reporting, credit repair services, or other personal consumer reports","I would like to request the suppression of the following items from my credit report, which are the result of my falling victim to identity theft. This information does not relate to [ transactions that I have made/accounts that I have opened ], as the attached supporting documentation can attest. As such, it should be blocked from appearing on my credit report pursuant to section 605B of the Fair Credit Reporting Act.",would like request suppression following item credit report result falling victim identity theft information relate transaction made/accounts opened attached supporting documentation attest blocked appearing credit report pursuant section 605b fair credit reporting act
Debt collection,"Over the past 2 weeks, I have been receiving excessive amounts of telephone calls from the company listed in this complaint. The calls occur between XXXX XXXX and XXXX XXXX to my cell and at my job. The company does not have the right to harass me at work and I want this to stop. It is extremely distracting to be told 5 times a day that I have a call from this collection agency while at work.",past 2 week receiving excessive amount telephone call company listed complaint call occur unknown unknown cell job company right harass work want stop extremely distracting told 5 time day call collection agency work
"Credit reporting, credit repair services, or other personal consumer reports",someone used my personal information to get medical treatment that i did not authorize.i have filed a report i have tried to dispute these 2 accounts and nothing gets resolved.,someone used personal information get medical treatment authorize.i filed report tried dispute 2 account nothing get resolved
"Money transfer, virtual currency, or money service","I was sold access to an event digitally, of which I have all the screenshots to detail the transactions, transferred the money and was provided with only a fake of a ticket. I have reported this to paypal and it was for the amount of {$21.00} including a {$1.00} fee from paypal.",
This occured on XX/XX/2019,"by paypal user who gave two accounts : 1 ) XXXX 2 ) XXXX XXXX""",sold access event digitally screenshots detail transaction transferred money provided fake ticket reported paypal amount 21.00 including 1.00 fee paypal occured 2019 paypal user gave two account 1 unknown 2 unknown
Debt collection,"While checking my credit report I noticed three collections by a company called ARS that i was unfamiliar with. I disputed these collections with XXXX, and XXXX and they both replied that they contacted the creditor and the creditor verified the debt so I asked for proof which both bureaus replied that they are not required to prove anything. I then mailed a certified letter to ARS requesting proof of the debts n the form of an original aggrement, or a proof of a right to the debt, or even so much as the process as to how the bill was calculated, to which I was simply replied a letter for each collection claim that listed my name an account number and an amount with no other information to verify the debts after I sent a clear notice to provide me evidence. Afterwards I recontacted both XXXX, and XXXX, to redispute on the premise that it is not my debt if evidence can not be drawn up, I feel as if I am being personally victimized by ARS on my credit report for debts that are not owed to them or any party for that matter, and I feel discouraged that the credit bureaus who control many aspects of my personal finances are so negligent about my information.",checking credit report noticed three collection company called ar unfamiliar disputed collection unknown unknown replied contacted creditor creditor verified debt asked proof bureau replied required prove anything mailed certified letter ar requesting proof debt n form original aggrement proof right debt even much process bill calculated simply replied letter collection claim listed name account number amount information verify debt sent clear notice provide evidence afterwards recontacted unknown unknown redispute premise debt evidence drawn feel personally victimized ar credit report debt owed party matter feel discouraged credit bureau control many aspect personal finance negligent information
Credit card or prepaid card,On XX/XX/2018 I made a {$87.00} purchase with a Suntrust rewards credit card that was eligible for an activated Suntrust deal for 20 % cash back on a purchase from XXXX XXXX.,


In [3]:
# Create a view or table

temp_table_name = "Try_data_csv"

df.createOrReplaceTempView(temp_table_name)

In [4]:
%sql

/* Query the created temp table in a SQL cell */

select * from `Try_data_csv`

_c0,_c1,_c2
Product,Consumer complaint narrative,Clean Text
Debt collection,transworld systems inc.,
is trying to collect a debt that is not mine,"not owed and is inaccurate.""",transworld system inc. trying collect debt mine owed inaccurate
"Credit reporting, credit repair services, or other personal consumer reports","I would like to request the suppression of the following items from my credit report, which are the result of my falling victim to identity theft. This information does not relate to [ transactions that I have made/accounts that I have opened ], as the attached supporting documentation can attest. As such, it should be blocked from appearing on my credit report pursuant to section 605B of the Fair Credit Reporting Act.",would like request suppression following item credit report result falling victim identity theft information relate transaction made/accounts opened attached supporting documentation attest blocked appearing credit report pursuant section 605b fair credit reporting act
Debt collection,"Over the past 2 weeks, I have been receiving excessive amounts of telephone calls from the company listed in this complaint. The calls occur between XXXX XXXX and XXXX XXXX to my cell and at my job. The company does not have the right to harass me at work and I want this to stop. It is extremely distracting to be told 5 times a day that I have a call from this collection agency while at work.",past 2 week receiving excessive amount telephone call company listed complaint call occur unknown unknown cell job company right harass work want stop extremely distracting told 5 time day call collection agency work
"Credit reporting, credit repair services, or other personal consumer reports",someone used my personal information to get medical treatment that i did not authorize.i have filed a report i have tried to dispute these 2 accounts and nothing gets resolved.,someone used personal information get medical treatment authorize.i filed report tried dispute 2 account nothing get resolved
"Money transfer, virtual currency, or money service","I was sold access to an event digitally, of which I have all the screenshots to detail the transactions, transferred the money and was provided with only a fake of a ticket. I have reported this to paypal and it was for the amount of {$21.00} including a {$1.00} fee from paypal.",
This occured on XX/XX/2019,"by paypal user who gave two accounts : 1 ) XXXX 2 ) XXXX XXXX""",sold access event digitally screenshots detail transaction transferred money provided fake ticket reported paypal amount 21.00 including 1.00 fee paypal occured 2019 paypal user gave two account 1 unknown 2 unknown
Debt collection,"While checking my credit report I noticed three collections by a company called ARS that i was unfamiliar with. I disputed these collections with XXXX, and XXXX and they both replied that they contacted the creditor and the creditor verified the debt so I asked for proof which both bureaus replied that they are not required to prove anything. I then mailed a certified letter to ARS requesting proof of the debts n the form of an original aggrement, or a proof of a right to the debt, or even so much as the process as to how the bill was calculated, to which I was simply replied a letter for each collection claim that listed my name an account number and an amount with no other information to verify the debts after I sent a clear notice to provide me evidence. Afterwards I recontacted both XXXX, and XXXX, to redispute on the premise that it is not my debt if evidence can not be drawn up, I feel as if I am being personally victimized by ARS on my credit report for debts that are not owed to them or any party for that matter, and I feel discouraged that the credit bureaus who control many aspects of my personal finances are so negligent about my information.",checking credit report noticed three collection company called ar unfamiliar disputed collection unknown unknown replied contacted creditor creditor verified debt asked proof bureau replied required prove anything mailed certified letter ar requesting proof debt n form original aggrement proof right debt even much process bill calculated simply replied letter collection claim listed name account number amount information verify debt sent clear notice provide evidence afterwards recontacted unknown unknown redispute premise debt evidence drawn feel personally victimized ar credit report debt owed party matter feel discouraged credit bureau control many aspect personal finance negligent information
Credit card or prepaid card,On XX/XX/2018 I made a {$87.00} purchase with a Suntrust rewards credit card that was eligible for an activated Suntrust deal for 20 % cash back on a purchase from XXXX XXXX.,


In [5]:
# With this registered as a temp view, it will only be available to this particular notebook. If you'd like other users to be able to query this table, you can also create a table from the DataFrame.
# Once saved, this table will persist across cluster restarts as well as allow various users across different notebooks to query this data.
# To do so, choose your table name and uncomment the bottom line.

permanent_table_name = "Try_data_csv"

# df.write.format("parquet").saveAsTable(permanent_table_name)

In [6]:
df = df.sample(False,0.1,0)

In [7]:
display(df)

_c0,_c1,_c2
IT SEEMS XXXX IS THE ONE INITIATING THIS AND THEN TELLING THE OTHER BUREAUS TO FOLLOW THEIR LEAD.,,
During the subsequent document collection period,the loan officer asked we could put down another half percent,which would increase our down payment to 3.5 % of the sales price. We agreed that we could afford the extra 0.5 %
On XX/XX/XXXX,we were introduced via email to a loan processor ( herein after loan processor or underwriter ; this introduction came directly from the loan processor and the loan officer was not involved in the introduction. At this time,the loan processor stated that our loan was conditionally approved
On XX/XX/XXXX,we received an email from the loan officer informing us that XXXX had acquired Closing Mark Home Loans ( herein after CMHL ) and that our mortgage application would be acquired by CMHL. The email stated that our mortgage would be redisclosed and that we would need to sign updated disclosures.,
On XX/XX/XXXX,we received 3 options for the interest rate lock. The option sheet was sent only with instructions to look at Options 1,2 and 3
On XX/XX/XXXX,we emailed our loan officer to ask about what name to put on the cashiers check for closing,as we never received closing instructions. While this email was sent on a Saturday
We received a formal loan denial after our closing,with the denial letters being sent on XX/XX/XXXX. Given that XX/XX/XXXX was a holiday,it seems likely that the letters were sent so as to not be received until after closing.
Although I am checking for and addressing missing and or deficient aspects of REPORTING COMPLIANCES and not contesting any debt of compliant nature,I should make you aware that since unlawful reporting transitions collection into an equally not complaint circumstance. Being still yet not validated by document fact in compliance to requisite standards,it is to be announced yet again that legally I have no knowledge of the validity of the alleged claims of delinquency and or derogatory nature
"Credit reporting, credit repair services, or other personal consumer reports",The credit bureaus are reporting inaccurate/outdated/incomplete personal information.,credit bureau reporting inaccurate/outdated/incomplete personal information
"Credit reporting, credit repair services, or other personal consumer reports",2ND NOTICE OF PENDING LITIGATION SEEKING RELIEF AND MONETARY DAMAGES UNDER FCRA SECTION 616 & SECTION 617 EXPERIAN HAS INQUIRES ON MY CREDIT REPORT THAT ARE : MINE/INACCURATE.FRAUD. I DID NOT GIVE NY PERMISSION TO REPORT THESE INQUIRES. I HAVE SUBMITTED A POLICE REPORT SEVERAL TIMES AND EXPERIAN HAS NOT OR REFUSED TO DELETE AND REMOVE THESE INQUIRES FROM MY CREDIT REPORT. I WAS NOT NOTIFIED THESE INQUIRES WAS OR WILL BE REPORTED TO MY CREDIT REPORT. INQUIRES NOT MINE : XXXX XXXX Inquiry from XX/XX/XXXX Credit Unions XXXX Inquiry from XX/XX/XXXX Personal Loans Cos.,


In [8]:
from pyspark.sql.functions import col, lower, regexp_replace, split

def clean_text(c):
  c = lower(c)
  c = regexp_replace(c, "[^a-zA-Z0-9\\s]", "")
  c = regexp_replace(c, "xxxx", "")
  c = split(c, "\\s+")
  #c = [a for a in c if len(a)>2]
  #c = ' '.join(c)
  return c

clean_text_df = df.select(clean_text(col("_c1")).alias("text"))

clean_text_df.printSchema()
clean_data = clean_text_df.where(col("text").isNotNull())
clean_data.show(10)

In [9]:
#data = df['_c1'].values
import pyspark.sql.functions as f
my_list = clean_data.select(f.collect_list('text')).first()[0]

In [10]:
my_list = ' '.join(my_list)
my_list

In [11]:
from pyspark.ml.feature import StopWordsRemover

# Define a list of stop words or use default list
remover = StopWordsRemover()
stopwords = remover.getStopWords() 

# Display default list
stopwords = stopwords+['xxxx','']
#stopwords

In [12]:
remover = StopWordsRemover(inputCol="text", outputCol="filtered")
words_df = remover.transform(clean_data)

In [13]:
words_df.show(10)

In [14]:
words_1 = words_df.select("filtered")

In [15]:
words_1

In [16]:
def joining(c):
  
  c = list(c|)
  return ' '.join(c)

words_df.select(joining(col("filtered")).alias("text1")).show()

In [17]:
my_clean.show(10)

In [18]:
idf = IDF(inputCol="text", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

In [19]:
from pyspark.ml.feature import Tokenizer, RegexTokenizer

sentenceDataFrame = spark.createDataFrame([
    (0, "Hi I heard about Spark"),
    (1, "I wish Java could use case classes."),
    (2, "Logistic, regression, models,are,neat")
  
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsDataFrame = tokenizer.transform(sentenceDataFrame)
for words_label in wordsDataFrame.select("words", "label").take(3):
    print(words_label)

In [20]:
from pyspark.sql.functions import regexp_replace, trim, col, lower
from pyspark.ml.feature import StopWordsRemover

# Define a list of stop words or use default list
remover = StopWordsRemover()
stopwords = remover.getStopWords() 


def removePunctuation(column):
    """Removes punctuation, changes to lower case, and strips leading and trailing spaces.

    Note:
        Only spaces, letters, and numbers should be retained.  Other characters should should be
        eliminated (e.g. it's becomes its).  Leading and trailing spaces should be removed after
        punctuation is removed.

    Args:
        column (Column): A Column containing a sentence.

    Returns:
        Column: A Column named 'sentence' with clean-up operations applied.
        
    """
    
    c = lower(column)
    c = regexp_replace(c, "[^a-zA-Z0-9\\s]", "")
    c  = regexp_replace(c, "xxxx", "")
    return c
  #trim(lower(regexp_replace(column, '([^\s\w_]|_)+', ''))).alias('sentence')


sentenceDF = sqlContext.createDataFrame([('Hi, you',),
                                         (' Look! No under_score!',),
                                         (' *      Remove punctuation then spaces  * ',)], ['sentence'])
  # dispaly first original sentence
sentenceDF.show(truncate=False)
  # then sentence with punctuation removed
s =(df
 .select(removePunctuation(col('_c1')).alias('text')))

s1 = )

In [21]:
df.summary().show()

In [22]:
from pyspark.sql.functions import split, explode

shakeWordsSplitDF = (sentenceDF.select(split(shakespeareDF.sentence, '\s+').alias('split')))
shakeWordsSingleDF = (shakeWordsSplitDF.select(explode(shakeWordsSplitDF.split).alias('word')))
shakeWordsDF = shakeWordsSingleDF.where(shakeWordsSingleDF.word <> '')
shakeWordsDF.show()
shakeWordsDFCount = shakeWordsDF.count()
print (shakeWordsDFCount)

In [23]:
from pyspark.sql.functions import split, explode

shakeWordsSplitDF = (s
                    .select(split(s.sentence, '\s+').alias('split')))
shakeWordsSingleDF = (shakeWordsSplitDF
                    .select(explode(shakeWordsSplitDF.split).alias('word')))
shakeWordsDF = shakeWordsSingleDF
shakeWordsDF.show()
shakeWordsDFCount = shakeWordsDF.count()
print (shakeWordsDFCount)

In [24]:
def wordCount(wordListDF):
    """Creates a DataFrame with word counts.

    Args:
        wordListDF (DataFrame of str): A DataFrame consisting of one string column called 'word'.

    Returns:
        DataFrame of (str, int): A DataFrame containing 'word' and 'count' columns.
    """
    return (wordListDF
                .groupBy('word').count())

In [25]:
from pyspark.sql.functions import desc

WordsAndCountsDF = wordCount(shakeWordsDF)
topWordsAndCountsDF = WordsAndCountsDF.orderBy("count", ascending=0)

topWordsAndCountsDF.show()

In [26]:
p = ['asdfasdg','asdgasg','hdsfhsdfh']

In [28]:
rdd2

In [29]:
' '.join(p)

In [30]:
from pyspark import SparkConf, SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF

In [31]:
rawData = my_list

In [32]:
kecilRawData = rawData.map(lambda x: x.lower())
fields = kecilRawData.map(lambda x: x.split("\t"))
documents = fields.map(lambda x: x[2].split(" "))
documentId = fields.map(lambda x: x[0])

In [33]:
hashingTF = HashingTF(100000)
tf = hashingTF.transform(my_list)

In [34]:
idf = IDF(minDocFreq=1)

In [35]:
tfidf = idf.fit_transform(documentId)

In [36]:
documentId

In [37]:
tf

In [38]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.sql import SparkSession
 
if __name__ == "__main__":
 
    sentenceData = spark.createDataFrame([ 
        (0.0, "Welcome to TutorialKart."), 
        (0.0, "Learn Spark at TutorialKart."), 
        (1.0, "Spark Mllib has TF-IDF.") 
    ], ["label", "sentence"]) 
 
    tokenizer = Tokenizer(inputCol="sentence", outputCol="words") 
    wordsData = tokenizer.transform(sentenceData) 
 
    hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20) 
    featurizedData = hashingTF.transform(wordsData) 
    # alternatively, CountVectorizer can also be used to get term frequency vectors
 
    idf = IDF(inputCol="rawFeatures", outputCol="features") 
    idfModel = idf.fit(featurizedData) 
    rescaledData = idfModel.transform(featurizedData) 
 
    rescaledData.select("label", "features").show() 
 


In [39]:
rescaledData.select('features').collect()