# 1. Setup

In [1]:
# import libraries
import numpy as np
import pandas as pd
import re
from google.colab import drive

In [2]:
pd.set_option('display.max_colwidth', None)

In [3]:
# mount Colab to Google Drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
# verify data exists in Google Drive dir
!ls 'drive/My Drive/W266'

 reddit_database_cleaned.csv   'Untitled document.gdoc'
 reddit_database.csv	        W266_Final_Project_Data_Cleanup.ipynb
 reddit_subset_cleaned.csv      W266_Final_Project_Pegasus_Ray.ipynb
 reddit_subset_cleaned.gsheet


# 2. Load Data

In [5]:
# load data
df = pd.read_csv('drive/My Drive/W266/reddit_database.csv', usecols=['title', 'post'])
df.head(3)

Unnamed: 0,title,post
0,YouTube's traffic data for music questioned,
1,November Sees Number of U.S. Videos Viewed Online Surpass 30 Billion for First Time on Record [comScore],
2,So what do you guys all do related to analytics? Why the interest?,"There's a lot of reasons to want to know all this stuff, so I figured I'd get to know the others that are on this subreddit.\n\nSo let's hear it: Webmasters? Coders? Marketers? Work for an analytics software company? You get the idea."


In [6]:
df = df[['title', 'post']]
df.head(5)

Unnamed: 0,title,post
0,YouTube's traffic data for music questioned,
1,November Sees Number of U.S. Videos Viewed Online Surpass 30 Billion for First Time on Record [comScore],
2,So what do you guys all do related to analytics? Why the interest?,"There's a lot of reasons to want to know all this stuff, so I figured I'd get to know the others that are on this subreddit.\n\nSo let's hear it: Webmasters? Coders? Marketers? Work for an analytics software company? You get the idea."
3,10 Web Analytics Tools For Tracking Your Visitors,
4,Improving Your Sense of Site,


In [7]:
#df.shape
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545427 entries, 0 to 545426
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   title   545427 non-null  object
 1   post    274209 non-null  object
dtypes: object(2)
memory usage: 8.3+ MB


# 3. Data Clensing

In [8]:
# drop NaNs from the title and post
df_cleaned = df.dropna(subset=['title', 'post'])
df_cleaned.head(5)

Unnamed: 0,title,post
2,So what do you guys all do related to analytics? Why the interest?,"There's a lot of reasons to want to know all this stuff, so I figured I'd get to know the others that are on this subreddit.\n\nSo let's hear it: Webmasters? Coders? Marketers? Work for an analytics software company? You get the idea."
5,"Google's Invasive, non-Anonymized Ad Targeting: A Quick Confirmation of previously suspected privacy issues","I'm cross posting this from /r/cyberlaw, hopefully you guys find it as interesting as I did(it deals with Google Analytics):\n\nSo quite awhile ago, I ordered a Papa John's pizza online. My job largely involves looking at ads that appear online, so afterwards I was quick to notice *I was getting a LOT* of Papa Johns ads (especially at night) being served through a Google owned company (DoubleClick media). Yesterday one of these ads popped up again on Youtube (a place that typically serves using the adwords program, not doubleclick), so I decided to copy the URL. \n\nFor those not in the advertising field: Making full use of Google's analytics tool means that certain information about the advertising campaign is leaked in the URL.\n\nSo let's break it apart: \n\n&gt;http://ad.doubleclick.net/click;h=(junk here);~sscs=?http://googleads.g.doubleclick.net/aclk?sa=l&amp;ai=(junk here)&amp;adurl=http://www.papajohns.com/index.shtm?utm_source=googlenetwork&amp;utm_medium=DisplayCPC&amp;utm_campaign=GoogleRemarketing\n\nFirst off, we see ~sscs: ~sscs is doubleclick's redirect variable. So rather than directly serving adwords ads, they overrode it to serve through doubleclick, then redirect through what would otherwise be an adwords link(http://googleads.g.doubleclick.net). This is tighter integration than is generally seen with adwords/doubleclick.\n\n* The interesting part is the end variables utm_source=**googlenetwork**&amp;utm_medium=**DisplayCPC**&amp;utm_campaign=**GoogleRemarketing**\n\n* DisplayCPC/googlenetwork - Confirmation that doubleclick is now more finely integrated with adwords.\n\n* ""GoogleRemarketing"", huh? Let's take a look at the definition for ""Remarketing""\n\n&gt;Using past campaign information to target a particular message to an audience.\n\nWhile in the past behavioral targetting has largely been based on the sum of your use, this is an interesting(though no doubt more widespread than is known) change in that; explicitly targeting old customers though a *massive* network of sites.\n\n-----------------------------------\n\nJust thought I'd put this out there. I'm sure it's not new to a lot of people, but at least to me it was interesting to see concepts like this actually put into practice on such a large scale. \n\n-----------------------------\n\nPS: I did a quick survey across several thousand domains, and for the record: right now, the most common external resource locations on the internet are(Google owned is bolded):\n\n**www.google-analytics.com**\n\n**pagead2.googlesyndication.com**\n\n**googleads.g.doubleclick.net**\n\nedge.quantserve.com\n\n**ad.doubleclick.net**\n\n**www.youtube.com**\n\nb.scorecardresearch.com\n\ns0.2mdn.net\n\ndg.specificclick.net\n\nview.atdmt.com\n\n**www.google.com**\n\n**ajax.googleapis.com**\n\n**partner.googleadservices.com**\n\nThat's a lot of data."
62,"DotCed - Functional Web Analytics - Tagging, Reporting, Analysis, and Strategy - www.dotced.com","DotCed,a Functional Analytics Consultant, offering Google Analytics Tagging, Reporting, Analysis, Strategy, SEO Auditing, and SEM Optimization. Call 919-404-9233 for a 15 min consultation."
64,Program Details - Data Analytics Course,Here is the program details of the data analytics certification course at the Academy for Decision Science Ahmedabad.
65,potential job in web analytics... need to analyze some data. what are they looking for?,"i decided grad school (physics) was not for me and i am branching out into the job market. a web analytics place is interested in me (and i'm interested in any kind of data analysis). ""The exercise is to use a comparison of three or more months of data to prepare a 5 to 10 slide PowerPoint presentation of any significant information about site visitors, what they are doing, how they arrive at our site that we could use to improve site performance as an acquisition source."" he said i should 'tell a story'. this is a field i am unfamiliar with so i'm looking for any basic tips, common pitfalls, and expectations. thanks. (i am quite familiar with data analysis in general)"


In [9]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 274209 entries, 2 to 545425
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   title   274209 non-null  object
 1   post    274209 non-null  object
dtypes: object(2)
memory usage: 6.3+ MB


In [10]:
df_nulls = df['post'].isnull()
df_cleaned[df_nulls]

  df_cleaned[df_nulls]


Unnamed: 0,title,post


In [11]:
# pre-inspection
df_cleaned[:5]

Unnamed: 0,title,post
2,So what do you guys all do related to analytics? Why the interest?,"There's a lot of reasons to want to know all this stuff, so I figured I'd get to know the others that are on this subreddit.\n\nSo let's hear it: Webmasters? Coders? Marketers? Work for an analytics software company? You get the idea."
5,"Google's Invasive, non-Anonymized Ad Targeting: A Quick Confirmation of previously suspected privacy issues","I'm cross posting this from /r/cyberlaw, hopefully you guys find it as interesting as I did(it deals with Google Analytics):\n\nSo quite awhile ago, I ordered a Papa John's pizza online. My job largely involves looking at ads that appear online, so afterwards I was quick to notice *I was getting a LOT* of Papa Johns ads (especially at night) being served through a Google owned company (DoubleClick media). Yesterday one of these ads popped up again on Youtube (a place that typically serves using the adwords program, not doubleclick), so I decided to copy the URL. \n\nFor those not in the advertising field: Making full use of Google's analytics tool means that certain information about the advertising campaign is leaked in the URL.\n\nSo let's break it apart: \n\n&gt;http://ad.doubleclick.net/click;h=(junk here);~sscs=?http://googleads.g.doubleclick.net/aclk?sa=l&amp;ai=(junk here)&amp;adurl=http://www.papajohns.com/index.shtm?utm_source=googlenetwork&amp;utm_medium=DisplayCPC&amp;utm_campaign=GoogleRemarketing\n\nFirst off, we see ~sscs: ~sscs is doubleclick's redirect variable. So rather than directly serving adwords ads, they overrode it to serve through doubleclick, then redirect through what would otherwise be an adwords link(http://googleads.g.doubleclick.net). This is tighter integration than is generally seen with adwords/doubleclick.\n\n* The interesting part is the end variables utm_source=**googlenetwork**&amp;utm_medium=**DisplayCPC**&amp;utm_campaign=**GoogleRemarketing**\n\n* DisplayCPC/googlenetwork - Confirmation that doubleclick is now more finely integrated with adwords.\n\n* ""GoogleRemarketing"", huh? Let's take a look at the definition for ""Remarketing""\n\n&gt;Using past campaign information to target a particular message to an audience.\n\nWhile in the past behavioral targetting has largely been based on the sum of your use, this is an interesting(though no doubt more widespread than is known) change in that; explicitly targeting old customers though a *massive* network of sites.\n\n-----------------------------------\n\nJust thought I'd put this out there. I'm sure it's not new to a lot of people, but at least to me it was interesting to see concepts like this actually put into practice on such a large scale. \n\n-----------------------------\n\nPS: I did a quick survey across several thousand domains, and for the record: right now, the most common external resource locations on the internet are(Google owned is bolded):\n\n**www.google-analytics.com**\n\n**pagead2.googlesyndication.com**\n\n**googleads.g.doubleclick.net**\n\nedge.quantserve.com\n\n**ad.doubleclick.net**\n\n**www.youtube.com**\n\nb.scorecardresearch.com\n\ns0.2mdn.net\n\ndg.specificclick.net\n\nview.atdmt.com\n\n**www.google.com**\n\n**ajax.googleapis.com**\n\n**partner.googleadservices.com**\n\nThat's a lot of data."
62,"DotCed - Functional Web Analytics - Tagging, Reporting, Analysis, and Strategy - www.dotced.com","DotCed,a Functional Analytics Consultant, offering Google Analytics Tagging, Reporting, Analysis, Strategy, SEO Auditing, and SEM Optimization. Call 919-404-9233 for a 15 min consultation."
64,Program Details - Data Analytics Course,Here is the program details of the data analytics certification course at the Academy for Decision Science Ahmedabad.
65,potential job in web analytics... need to analyze some data. what are they looking for?,"i decided grad school (physics) was not for me and i am branching out into the job market. a web analytics place is interested in me (and i'm interested in any kind of data analysis). ""The exercise is to use a comparison of three or more months of data to prepare a 5 to 10 slide PowerPoint presentation of any significant information about site visitors, what they are doing, how they arrive at our site that we could use to improve site performance as an acquisition source."" he said i should 'tell a story'. this is a field i am unfamiliar with so i'm looking for any basic tips, common pitfalls, and expectations. thanks. (i am quite familiar with data analysis in general)"


In [12]:
# contraction rules
contraction_mapping = {
    "can't": "cannot",
    "won't": "will not",
    "n't": " not",  # General rule for contractions ending in "n't"
    "'re": " are",
    "'s": " is",
    "'d": " would",
    "'ll": " will",
    "'t": " not",
    "'ve": " have",
    "'m": " am"
    # Add more contractions and their expansions as needed
}

# contraction function
def expand_contractions(text, contraction_mapping):
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), flags=re.IGNORECASE|re.DOTALL)

    def expand_match(contraction):
        match = contraction.group(0)
        expanded_contraction = contraction_mapping.get(match.lower()) if contraction_mapping.get(match.lower()) else contraction_mapping.get(match.lower()[:-1] + 't')  # For general n't contraction
        return expanded_contraction

    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

In [13]:
# apply the function to the 'post' column
df_cleaned['post'] = df_cleaned['post'].apply(lambda x: expand_contractions(x, contraction_mapping))
# apply the function to the 'title' column
df_cleaned['title'] = df_cleaned['title'].apply(lambda x: expand_contractions(x, contraction_mapping))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['post'] = df_cleaned['post'].apply(lambda x: expand_contractions(x, contraction_mapping))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['title'] = df_cleaned['title'].apply(lambda x: expand_contractions(x, contraction_mapping))


In [14]:
# pre-inspection
df_cleaned[:5]

Unnamed: 0,title,post
2,So what do you guys all do related to analytics? Why the interest?,"There is a lot of reasons to want to know all this stuff, so I figured I would get to know the others that are on this subreddit.\n\nSo let is hear it: Webmasters? Coders? Marketers? Work for an analytics software company? You get the idea."
5,"Google is Invasive, non-Anonymized Ad Targeting: A Quick Confirmation of previously suspected privacy issues","I am cross posting this from /r/cyberlaw, hopefully you guys find it as interesting as I did(it deals with Google Analytics):\n\nSo quite awhile ago, I ordered a Papa John is pizza online. My job largely involves looking at ads that appear online, so afterwards I was quick to notice *I was getting a LOT* of Papa Johns ads (especially at night) being served through a Google owned company (DoubleClick media). Yesterday one of these ads popped up again on Youtube (a place that typically serves using the adwords program, not doubleclick), so I decided to copy the URL. \n\nFor those not in the advertising field: Making full use of Google is analytics tool means that certain information about the advertising campaign is leaked in the URL.\n\nSo let is break it apart: \n\n&gt;http://ad.doubleclick.net/click;h=(junk here);~sscs=?http://googleads.g.doubleclick.net/aclk?sa=l&amp;ai=(junk here)&amp;adurl=http://www.papajohns.com/index.shtm?utm_source=googlenetwork&amp;utm_medium=DisplayCPC&amp;utm_campaign=GoogleRemarketing\n\nFirst off, we see ~sscs: ~sscs is doubleclick is redirect variable. So rather than directly serving adwords ads, they overrode it to serve through doubleclick, then redirect through what would otherwise be an adwords link(http://googleads.g.doubleclick.net). This is tighter integration than is generally seen with adwords/doubleclick.\n\n* The interesting part is the end variables utm_source=**googlenetwork**&amp;utm_medium=**DisplayCPC**&amp;utm_campaign=**GoogleRemarketing**\n\n* DisplayCPC/googlenetwork - Confirmation that doubleclick is now more finely integrated with adwords.\n\n* ""GoogleRemarketing"", huh? Let is take a look at the definition for ""Remarketing""\n\n&gt;Using past campaign information to target a particular message to an audience.\n\nWhile in the past behavioral targetting has largely been based on the sum of your use, this is an interesting(though no doubt more widespread than is known) change in that; explicitly targeting old customers though a *massive* network of sites.\n\n-----------------------------------\n\nJust thought I would put this out there. I am sure it is not new to a lot of people, but at least to me it was interesting to see concepts like this actually put into practice on such a large scale. \n\n-----------------------------\n\nPS: I did a quick survey across several thousand domains, and for the record: right now, the most common external resource locations on the internet are(Google owned is bolded):\n\n**www.google-analytics.com**\n\n**pagead2.googlesyndication.com**\n\n**googleads.g.doubleclick.net**\n\nedge.quantserve.com\n\n**ad.doubleclick.net**\n\n**www.youtube.com**\n\nb.scorecardresearch.com\n\ns0.2mdn.net\n\ndg.specificclick.net\n\nview.atdmt.com\n\n**www.google.com**\n\n**ajax.googleapis.com**\n\n**partner.googleadservices.com**\n\nThat is a lot of data."
62,"DotCed - Functional Web Analytics - Tagging, Reporting, Analysis, and Strategy - www.dotced.com","DotCed,a Functional Analytics Consultant, offering Google Analytics Tagging, Reporting, Analysis, Strategy, SEO Auditing, and SEM Optimization. Call 919-404-9233 for a 15 min consultation."
64,Program Details - Data Analytics Course,Here is the program details of the data analytics certification course at the Academy for Decision Science Ahmedabad.
65,potential job in web analytics... need to analyze some data. what are they looking for?,"i decided grad school (physics) was not for me and i am branching out into the job market. a web analytics place is interested in me (and i am interested in any kind of data analysis). ""The exercise is to use a comparison of three or more months of data to prepare a 5 to 10 slide PowerPoint presentation of any significant information about site visitors, what they are doing, how they arrive at our site that we could use to improve site performance as an acquisition source."" he said i should notell a story. this is a field i am unfamiliar with so i am looking for any basic tips, common pitfalls, and expectations. thanks. (i am quite familiar with data analysis in general)"


In [15]:
# text normalization
def text_normalization(text):
  # Remove HTML tags
  text = re.sub(r'<.*?>', '', text)
  # Remove URLs
  text = re.sub(r'https?://\S+|www\.\S+', '', text)
  # Remove newlines and carriage returns
  text = re.sub(r'[\r\n]', ' ', text)
  # Remove tabs
  text = re.sub(r'\t', ' ', text)
  # Remove removes numeric sequences
  text = re.sub(r'\d+', '', text)
  # Remove punctuation/special characters
  text = re.sub(r'[^\w\s]', '', text)
  # Remove parentheses
  #text = re.sub(r'[()]', '', text)
  # convert to lowercase
  text = text.lower()
  # remove extra spaces
  text = re.sub(r' +', ' ', text.strip())

  return text

# apply text normalization
df_cleaned['post'] = df_cleaned['post'].apply(text_normalization)
df_cleaned['title'] = df_cleaned['title'].apply(text_normalization)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['post'] = df_cleaned['post'].apply(text_normalization)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['title'] = df_cleaned['title'].apply(text_normalization)


Calculate the average number of words for all Reddit posts

In [16]:
# average words
df_cleaned['word_count'] = df_cleaned['post'].str.split().str.len()
avg_word_count = df_cleaned['word_count'].mean()
print(f"Average words in Reddit Post column: {avg_word_count:.2f}")

Average words in Reddit Post column: 113.56


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['word_count'] = df_cleaned['post'].str.split().str.len()


Include Reddit posts equal to or above the average number of words to ensure that we have a dataset with more text to train This also filters out titles and posts with no words.

In [17]:
post_lengths = df_cleaned['post'].str.len()
title_lengths = df_cleaned['title'].str.len()
word_counts = df_cleaned['post'].str.split().str.len()

df_cleaned = df_cleaned[(post_lengths >= 1) & (title_lengths >= 1) & (word_counts >= avg_word_count)]

# Reset the index
df_cleaned = df_cleaned.reset_index(drop=True)

In [18]:
# post-inspection
df_cleaned[:5]

Unnamed: 0,title,post,word_count
0,google is invasive nonanonymized ad targeting a quick confirmation of previously suspected privacy issues,i am cross posting this from rcyberlaw hopefully you guys find it as interesting as i didit deals with google analytics so quite awhile ago i ordered a papa john is pizza online my job largely involves looking at ads that appear online so afterwards i was quick to notice i was getting a lot of papa johns ads especially at night being served through a google owned company doubleclick media yesterday one of these ads popped up again on youtube a place that typically serves using the adwords program not doubleclick so i decided to copy the url for those not in the advertising field making full use of google is analytics tool means that certain information about the advertising campaign is leaked in the url so let is break it apart gt heresscs hereampadurl first off we see sscs sscs is doubleclick is redirect variable so rather than directly serving adwords ads they overrode it to serve through doubleclick then redirect through what would otherwise be an adwords link this is tighter integration than is generally seen with adwordsdoubleclick the interesting part is the end variables utm_sourcegooglenetworkamputm_mediumdisplaycpcamputm_campaigngoogleremarketing displaycpcgooglenetwork confirmation that doubleclick is now more finely integrated with adwords googleremarketing huh let is take a look at the definition for remarketing gtusing past campaign information to target a particular message to an audience while in the past behavioral targetting has largely been based on the sum of your use this is an interestingthough no doubt more widespread than is known change in that explicitly targeting old customers though a massive network of sites just thought i would put this out there i am sure it is not new to a lot of people but at least to me it was interesting to see concepts like this actually put into practice on such a large scale ps i did a quick survey across several thousand domains and for the record right now the most common external resource locations on the internet aregoogle owned is bolded pageadgooglesyndicationcom googleadsgdoubleclicknet edgequantservecom addoubleclicknet bscorecardresearchcom smdnnet dgspecificclicknet viewatdmtcom ajaxgoogleapiscom partnergoogleadservicescom that is a lot of data,351
1,potential job in web analytics need to analyze some data what are they looking for,i decided grad school physics was not for me and i am branching out into the job market a web analytics place is interested in me and i am interested in any kind of data analysis the exercise is to use a comparison of three or more months of data to prepare a to slide powerpoint presentation of any significant information about site visitors what they are doing how they arrive at our site that we could use to improve site performance as an acquisition source he said i should notell a story this is a field i am unfamiliar with so i am looking for any basic tips common pitfalls and expectations thanks i am quite familiar with data analysis in general,123
2,how to identify which google analytics account is tracking my site,hey all my gf is having trouble with ga and has not gotten any response in days from posting in the google help forums so i figure i would try here question as follows i have a client that we coded a website for and google analytics was plugged into it we would like to look at the statistics for the site and no one can identify what account is associated with the tracking code that is embedded i have pulled the user account number from the source code i am just not sure how to identify what the login associated with it is can anyone help this is a fairly urgent request thanks in advance for any help,119
3,ga how can i track clicks on a single link on a transition page hosted on another serverdomain,i have a client that wants to know the ctr on one of the ads they have running my mailing i have told them it would be simplest if they just monitored all incoming traffic from my domain in their analytics but i think they are technologically retarded so they have asked us to provide the number of clicks for them here is my problem the adtransition page only shows up for clicks from my enewsletter and each link is built in html with a different url they all begin the same but link to different articles or sites so the code is different for each one i pasted the ga code into each transition page but i cannot seem to find the number of times a link is clicked on those pages and then it leads to a second problem i would have to check each piece of content each day to check for links there has to be an easier way the internet proved no luck affiliate marketer friends had no knowledge so i am bringing the question to you is there a way to track clicks of a single url leading away from multiple pages that my code is in and if so tell me tell me tell me,212
4,google analytics question tracking multiple domains,i have a traffic network of several different sites and i would like to compare data like bounce rate time on site similar goals etc across these sites which are all set up under the same ga account is there a way i can compare these different data from different profiles in one combined profile so i can avoid jumping from one profile to another to compare this data i was looking into choosing the setting tracking multiple domains but that does not seem give the feature to compare data between different sites but more track users across several domains any tips or links to a solution to this problem would be greatly appreciated,114


In [19]:
df_cleaned.tail(5)

Unnamed: 0,title,post,word_count
95633,interpretation of coxph model,i have started fitting coxph models at uni and my supervisor is questioning whether the model is handling clustering properly in our timedependentcovariate cox model the output i am getting is gt sfit coxphsurvtstarttstop endptsbp clusterpatid data sdata_subsetxt gt summarysfit call coxphformula survtstart tstop endpt sbp data sdata_subset x t cluster patid n number of events coef expcoef secoef robust se z prgtz sbp lte signif codes expcoef expcoef lower upper sbp concordance se likelihood ratio test on df plte wald test on df plte score logrank test on df plte robust plte note the likelihood ratio and score tests assume independence of observations within a cluster the wald and robust score tests do not the issue my supervisor has is that n is not the number of patients i believe instead that this is the number of rows in the datasetmultiple rows per patient i am struggling to find any evidence of this anywhere it might help if i can find the number of cluster levels within the model object is there a way of doing this,178
95634,how do i convert a timeseries date from a character into something quantifiable,hey guys i want to create an object that repeats and counts itself per day from the column survey creation date so to say on i have observations beeps eg and on i have beeps and on i have beeps and na is or so nd i also want to create a dayvar object which repeats and counts the days to say is is and so on ampxb it would be also easier if i could just subset the i have created a list for you to replicate i am thankful for every help i can get structurelistsurvey creation date c survey completion date c nd i also want to create a dayvar object that repeats and counts the days to say is is etchave observations beeps eg and on i have beeps and on i have beeps and na is or soo since your last survey how many alcoholic drinks have you had c i feel comfortable in my current location c i feel stressed c i feel downdepressed c rownames cna l class ctbl_df tbl dataframe,178
95635,two linear regression lines through specific data points and crossing point,hi everyone for a university assignment we are tasked to create a graph of our data and create two regression lines that cross each other to determine the turning point of our data my plan would be to create the graph with ggplot right now i am struggling how to insert the two lines the least dumb way but i have also no idea how i would determine the crossing point without pointing on my screen and saying there data is this csv tempconduc my last attempt now has been ggplotpit aesx temp y conduc geom_point geom_smoothmethod lm formula unlistdfunlistdf geom_smoothmethod lm formula unlistdfunlistdf but this just gives me an error message that i may understand where it comes from but gives me no clue how to further proceed warning message computation failed in stat_smooth variable lengths differ found for weights the desired graph would look something like this without the dashed line that was just my old redacted version thank you everyone in advance,165
95636,help interpretting lmer model output,hello i am wonder how the following output would be interpreted i ran a piecewise linear mixed effect model with fixed effect time predictors each time variable represents a segment of time the full length of time is units time is coded c etc time is cetc and time is coded c the dependent variable is an logx transformation to my knowledge the random effects are the betweengroup effects and the fixed effects are withingroup with group being soccodef i am not sure how to interpret the effects though help linear mixed model fit by maximum likelihood ttests use satterthwaite is method lmermodlmertest formula logpint time itime time time soccodef data onet aic bic loglik deviance dfresid scaled residuals min q median q max random effects groups name variance stddev soccodef intercept residual number of obs groups soccodef fixed effects estimate std error df t value prgtt intercept e e e lt e time e e e lt e itime e e e lt e time e e e e time e e e signif codes correlation of fixed effects intr time it time time itime time time,188
95637,print only loadings in factanal,hi everybody i am currently doing a factor analysis using the factanalcommand since i want to export the results to excel i want to basically get a table where all the loadings of all the items for all the factors are listed but every time i print the loadings it also prints the ss loadings the proportion variance and the cumulative variance for all the factors right below the loadings for the items i had the same problem when i did a principal component analysis and solved it there by only printing and exporting the weights instead of the loadings but factanal does not seem to have such a component does anybody know how i could fix that,118


In [20]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95638 entries, 0 to 95637
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   title       95638 non-null  object
 1   post        95638 non-null  object
 2   word_count  95638 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 2.2+ MB


In [21]:
# drop duplicates
df_cleaned = df_cleaned.drop_duplicates()

In [22]:
# shuffle dataframe
df_shuffled = df_cleaned.sample(frac=1, random_state=2).reset_index(drop=True)

# 4. Export Data
Export the top 25K or roughly 10% of cleaned up data to CSV 10%.

In [24]:
df_cleaned[['title', 'post']].head(25000).to_csv('drive/My Drive/W266/reddit_subset_cleaned.csv', index=False)