## Background
Endogenous circadian clocks exist in all organisms, from cyanobacteria to vertebrates, functioning to synchronize behavior and physiology with the external environment. Light is one of the strongest zeitgebers, or external synchronizing cues, of endogenous circadian rhythms across taxa, synchronizing ciradian rhythms of basic biological processes such as sleep–wake cycles, body temperature, hormone secretion, cardiovascular systems, and metabolism with the natural patterns of sunlight. The invention of artificial light has disrupted natural photoperiodic cues used to entrain these rhythms and has led to consequences for humans and wildlife alike. It is now well documented that exposure to artificial light at night (ALAN) alters behavior, endocrine pathways, metabolism, cardiovascular function, and leads to pathology. Still, it remains unclear downstream effects of ALAN are related to circadian rhythm regulation. Does ALAN alter behavior and physiology and cause long-term pathology through circadian rhythm misalignment, or act on behavior and physiology directly? To effectively mitigate the effects of light pollution, it is important to understand the mechanisms by which exposure to dim, environmentally relevant levels of night-light affect behavior and physiology in diurnal organisms. 

Birds are a useful diurnal model for answering this question, as the neuroendocrine pathways that transduce light information from sensory systems to downstream processes have been well described and are conserved across vertebrates. In birds, light stimulates non-visual photoreceptors (opsins) in the retina, the pineal gland, and suprachiasmatic nuclei (SCN) of the hypothalamus. Stimulation by light results in the expression of pacemaker genes Clock (Clk) and Brain and muscle Arnt-like protein-1 (Bmal1), which thereafter induces expression of Period (Per) and Cryptochrome (Cry), creating an autoregulatory feedback loop where expression of each gene oscillates in a unique 24-hour rhythm. This central clock in the brain entrains circadian oscillators in peripheral tissues, such as the liver and heart, and affects downstream physiology and behavior by regulating the synthesis and release of melatonin. 

In this study we link expression of pacemaker genes in brain tissue and melatonin with the behavioral (activity) circadian rhythms under ALAN. We predict that individuals exposed to ALAN for 10 days will have disrupted circadian rhythms of activity, circadian gene expression and plasma melatonin compared to controls. 

## Data collection
All procedures were carried out in accordance with National Institute of Health guidelines and were approved by the University of Nevada, Reno Institutional Animal Care and Use Committee. Twenty-four zebra finches were housed in individual cages and habituated to 10L:14D photoperiod for 4 weeks. Daylights turned on at 07:00 and turned off at 17:00. Night lights turned on at 17:00 and turned off at 07:00 and were provided by a 20cm X 1.5cm 5000K LED strip and standardized at 1.5 lux ±0.01. Active perches within each cage recorded movement on/off of the perch constantly, which is a reliable measure of circadian activity. The perch activity was recorded on a Dell Precision 5810 Tower with an Intel® Xeon® Processor (E5-1620 v3) at 3.5GHz, and an AMD FirePro™ W4100 graphics card. A graphical user interface (GUI) was created in MATLAB to both record and display the activity of the birds via the active perches. The active perches were connected to an optical end-stop so that the downward force of the bird caused the wooden perch to shift down, thus blocking the signal between the emitter and the receiver on the end-stop.  Upon hopping off, the active perch returns to its neutral position. Each end-stop sent a 1 or 0 to the computer depending on whether the bird was on or off the active perch, respectively. Data for all of the active perches were collected every 0.23 seconds (approximately), and totaled for every minute of activity. 

Data from each day is output into a text file, resulting in one text file per day of experiment.  The text files are named in a terrible string format including spaces and commas and a combination of letters and numbers.

e.g. "July 01, 2021.txt"

Within the file, activity per minute is recorded in long strings of integers, separated by tabs (\t), and organized in 24 rows of 60 columns (rows corresponding to hours and columns corresponding to minutes). Data for each cage is sequential, therefore 24 rows x 24 cages adds up to 600 rows. The first minute is considered column 0, the last minute (59) is column 59, and so column 60 ends up being empty. 

In [199]:
import numpy as np
import pandas as pd
import os 
import glob
from datetime import datetime

os.chdir('/Users/val/Desktop/hop_data')
print(os.getcwd())




/Users/val/Desktop/hop_data


In [200]:
july1=pd.read_csv('july_samples/July 01, 2021.txt', header=None, sep='\t')
print(july1.head())

#count number of rows
print(len(july1.index))

   0   1   2   3   4   5   6   7   8   9   ...  51  52  53  54  55  56  57  \
0   0   0   0   0   0   0   0   0   0   1  ...   0   0   0   0   0   0   0   
1   0   0   2   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   
2   0   0   0   0   0   0   0   1   0   0  ...   0   1   0   0   0   0   0   
3   0   0   0   0   0   1   0   0   1   0  ...   0   0   0   0   0   0   0   
4   0   0   2   0   0   2   0   1   1   0  ...   0   0   0   0   0   0   0   

   58  59  60  
0   0   1 NaN  
1   1   1 NaN  
2   0   0 NaN  
3   0   1 NaN  
4   0   0 NaN  

[5 rows x 61 columns]
576


The challenge will be to streamline this data into a format that is useable for analyses. All of the seperate data files should be combined into one long-format dataframe. Date will need to be extracted from the file name and put into a column. Cages will need to be seperated based on the order of the rows.  Day and night will need to be designated based on hours that daylights are on and off. 'ALAN' will need to be a binomial variable that tells whether birds are under ALAN or not, based on time of day and day of experient. Below is an example of the desired output. 

In [201]:
R_example_df=pd.read_csv('ExampleFormat.csv')
print(R_example_df.head())


   Cage  Hour  Minutes  Hops  Phase     Date    treat sex        stage  ALAN
0     1     0        1     0  night  12/1/19  Control   m  acclimation     1
1     1     1        1     0  night  12/1/19  Control   m  acclimation     1
2     1     2        1     0  night  12/1/19  Control   m  acclimation     1
3     1     3        1     0  night  12/1/19  Control   m  acclimation     1
4     1     4        1     0  night  12/1/19  Control   m  acclimation     1


Cassandra also works on this data, but uses a specific program that was designed to analyze circadian rhythm data in flies. This program takes a different format (tab delimited, different date format), but still requires all of the data to be compiled and reorganized. See example below. 

In [206]:
Fly_example_df = pd.read_csv('fly_data/Monitor40.txt', sep='\t')
print(Fly_example_df.head())

   61355  20 Sep 19  09:02:00  1  0  0.1  0.2  0.3  0.4  0.5  ...  2.5  1.4  \
0  61356  20 Sep 19  09:03:00  1  0    0    0    0    0    0  ...    1    0   
1  61357  20 Sep 19  09:04:00  1  0    0    0    0    0    0  ...    0    1   
2  61358  20 Sep 19  09:05:00  1  0    0    0    0    0    0  ...    1    2   
3  61359  20 Sep 19  09:06:00  1  0    0    0    0    0    0  ...    0    0   
4  61360  20 Sep 19  09:07:00  1  0    0    0    0    0    0  ...    0    0   

   0.17  2.6  2.7  2.8  0.18  0.19  0.20  1.5  
0     0    1    0    4     0     0     1    0  
1     0    1    1    3     0     0     6    1  
2     0    1    0    5     0     0     0    1  
3     0    0    2    0     0     0     3    0  
4     0    1    0    0     0     0     1    0  

[5 rows x 42 columns]


We will use Python to compile, reorganize, and tidy the data into two useable formats. As an additional component, Cassandra and I will work together practice collaborating on Git Hub using the same data. 

## Methods
#### (meta-data outline)


Goals:

For each file... 
1. Open file
2. Create column for hours, divide in 24 rows of 24 hours
3. Divide data in file into 24 birds, label as each bird
4. Transform data long, so that "Hour" and "Minute" are each one column
5. Create binomial (0/1) for whether bird was active each minute 
6. Add phase (night/day) depending on hour
7. Add a date to all data (pulling date from original file name, but transformng it into a 'good' date, e.g. Year-Month-Day)
8. Separate date into 3 columns for year, month, day
9. Save file as new file

Then...
10. Loop through files in folder and perform same program on all
11. Combine all new files into one master dataset

In [207]:
#check i'm still in the right directory
pwd

'/Users/val/Desktop/hop_data'

In [208]:
#### Step 1.  

# Data rows are tab-delimited. Upload data and divide by tabs
df=pd.read_csv('new_hop_data/July 01, 2021.txt', header=None, sep='\t')

# (widen internet browser to get all columns!)
print(df.head(24))

#count number of rows
len(df.index)


    0   1   2   3   4   5   6   7   8   9   ...  51  52  53  54  55  56  57  \
0    0   0   0   0   0   0   0   0   0   1  ...   0   0   0   0   0   0   0   
1    0   0   2   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   
2    0   0   0   0   0   0   0   1   0   0  ...   0   1   0   0   0   0   0   
3    0   0   0   0   0   1   0   0   1   0  ...   0   0   0   0   0   0   0   
4    0   0   2   0   0   2   0   1   1   0  ...   0   0   0   0   0   0   0   
5    0   1   1   2   2   2   2   2   2   2  ...   2   3   2   0   0   2   3   
6    1   3   4   4   2   1   2   4   7   4  ...   1   0   0   1   1   1   5   
7    2   2   0   0   0   0   0   0   0   1  ...   0   1   0   0   2   0   1   
8    1   0   0   2   0   0   0   0   1   0  ...   1   3   2   1   2   1   1   
9    0   0   0   1   1   1   1   1   1   1  ...   0   0   0   0   0   0   0   
10   1   1   0   0   0   0   0   0   0   0  ...   0   0   0   0   1   0   0   
11   1   0   0   0   0   0   0   0   0   0  ...   0 

576

In [213]:
# the last column is all NAs! A byproduct of the Matlab program 
# So, we need to remove the last column
df = df.dropna(axis=1, how='all')
df.iloc[: , -5:]

Unnamed: 0,55,56,57,58,59
0,0,0,0,0,1
1,0,0,0,1,1
2,0,0,0,0,0
3,0,0,0,0,1
4,0,0,0,0,0
...,...,...,...,...,...
571,0,0,0,0,0
572,0,0,0,0,0
573,0,0,0,0,0
574,0,0,0,0,0


In [212]:
#### STEP 2. 
#  There are 576 rows, which corresponds to 24 birds x 24 hours. 
#  Assign every 24 rows to an integer 

Hours_list = list(range(24))
df.insert(loc=0,
          column='hour',
          value=Hours_list*24)

ValueError: cannot insert hour, already exists

In [214]:
#### Step 3.  

# Add bird ID to each 24 rows

BirdIDs = list(range(1,25,1))
Birds_list = np.repeat(BirdIDs,24)
len(Birds_list)  #check that this created a list of 576 - 24 birds x 24 hours
print(Birds_list)

[ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6
  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7
  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9
 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12
 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13
 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14

In [215]:
# Add bird ID to each row 
df.insert(loc=1,
          column='cage',
          value=Birds_list)

In [216]:
# Test if it worked, celebrate
print(df.head(100))
print("YAY!")

    hour  cage  0  1  2  3  4  5  6  7  ...  50  51  52  53  54  55  56  57  \
0      0     1  0  0  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0   
1      1     1  0  0  2  0  0  0  0  0  ...   0   0   0   0   0   0   0   0   
2      2     1  0  0  0  0  0  0  0  1  ...   0   0   1   0   0   0   0   0   
3      3     1  0  0  0  0  0  1  0  0  ...   0   0   0   0   0   0   0   0   
4      4     1  0  0  2  0  0  2  0  1  ...   2   0   0   0   0   0   0   0   
..   ...   ... .. .. .. .. .. .. .. ..  ...  ..  ..  ..  ..  ..  ..  ..  ..   
95    23     4  0  0  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0   
96     0     5  0  0  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0   
97     1     5  0  0  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0   
98     2     5  0  0  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0   
99     3     5  0  0  0  0  0  0  0  0  ...   0   0   0   0   0   0   0   0   

    58  59  
0    0   1  
1    1   1  
2    0   0  

In [217]:
# Step 4. Wide to long

df_long = pd.melt(df, id_vars=['cage','hour'], var_name='minute', value_name='hops')

In [218]:
# Check that it worked?
print(df_long.head(10))

   cage  hour minute  hops
0     1     0      0     0
1     1     1      0     0
2     1     2      0     0
3     1     3      0     0
4     1     4      0     0
5     1     5      0     0
6     1     6      0     1
7     1     7      0     2
8     1     8      0     1
9     1     9      0     0


In [219]:
# Step 5. Add binomial

df_long['active'] = np.where((df_long.hops > 0), 1, 0)

In [220]:
# Step 6. Add phase (night/day) 

df_long['phase'] = np.where((df_long.hour > 11), "night", "day")

In [221]:
# Check that it worked?
print(df_long.head(24))

    cage  hour minute  hops  active  phase
0      1     0      0     0       0    day
1      1     1      0     0       0    day
2      1     2      0     0       0    day
3      1     3      0     0       0    day
4      1     4      0     0       0    day
5      1     5      0     0       0    day
6      1     6      0     1       1    day
7      1     7      0     2       1    day
8      1     8      0     1       1    day
9      1     9      0     0       0    day
10     1    10      0     1       1    day
11     1    11      0     1       1    day
12     1    12      0     0       0  night
13     1    13      0     0       0  night
14     1    14      0     0       0  night
15     1    15      0     0       0  night
16     1    16      0     0       0  night
17     1    17      0     0       0  night
18     1    18      0     0       0  night
19     1    19      0     0       0  night
20     1    20      0     0       0  night
21     1    21      0     0       0  night
22     1   

In [223]:
#### Step 7... and all the rest!

# Looping through files and extracting date
# starting my script with the following should loop through files in a given directory
# It will extract the date (filename) and read the file in order to do all above steps

# (dont actually run this, it's incomplete)

#!/usr/bin/env python3

import os
import glob
import numpy as np
import pandas as pd

path = input("type path for directory, no quotes (hint: /Users/val/Desktop/hop_data/july_samples): "

folder = os.fsencode(path)

all_files = glob.glob(os.path.join(path, "*.txt"))

for file in all_files:
    date = os.path.basename(file)  # this pulls the name of the file (which is the date)
    df_test=pd.read_csv(file, header=None, sep='\t') # this is what will replace my original step 1
    # change df_test to df and do all the above code
             
    print(date)
    print(df_test.head(5))

SyntaxError: invalid syntax (3968979540.py, line 18)

In [225]:
# Still, date format is 'July 03, 2021.txt', so we need to fix this
# Once I have extracted the date, the following code will re-format it
# need to import 'datetime'
    
from datetime import datetime

date = "July 03, 2021.txt"

newdate = date.strip(".txt")
newdate = newdate.replace(",","")
newdate = newdate.replace(" ", "_")
print(newdate)
#now reads 'July_03_2021'

newdate = datetime.strptime(newdate, '%B_%d_%Y')
#now reads 2021-07-03 !!!
    

July_03_2021


In [226]:
# now instead of printing the date, we want to use it and put it in a new column
# so will add this to the end of the loop
df_long['date'] = newdate

In [227]:
df_long.head()

Unnamed: 0,cage,hour,minute,hops,active,phase,date
0,1,0,0,0,0,day,2021-07-03
1,1,1,0,0,0,day,2021-07-03
2,1,2,0,0,0,day,2021-07-03
3,1,3,0,0,0,day,2021-07-03
4,1,4,0,0,0,day,2021-07-03


In [228]:
# As a final data organization step, I want to separate the date into 3 separate columns 
df_long['year']= df_long['date'].dt.year
df_long['month']= df_long['date'].dt.month
df_long['day']= df_long['date'].dt.day



In [229]:
df_long.head()

Unnamed: 0,cage,hour,minute,hops,active,phase,date,year,month,day
0,1,0,0,0,0,day,2021-07-03,2021,7,3
1,1,1,0,0,0,day,2021-07-03,2021,7,3
2,1,2,0,0,0,day,2021-07-03,2021,7,3
3,1,3,0,0,0,day,2021-07-03,2021,7,3
4,1,4,0,0,0,day,2021-07-03,2021,7,3


In [147]:
pwd

'/Users/val/Desktop/hop_data'

In [173]:
#### Lastly, I will write each of these files into a new file with a new name (newdate)

#initiate directory for outfile, and export dataframe to csv

OUTpath = input("type path for OUTPUT directory, no quotes (hint: /Users/val/Desktop/hop_data/hopdata_clean): "
#os.mkdir(OUTpath)
#os.chdir(OUTpath)

#os.mkdir('/Users/val/Desktop/hop_data/ActivityFiles_clean')
#os.chdir('/Users/val/Desktop/hop_data/ActivityFiles_clean')
#df.to_csv(str(newdate.date())+'.csv')  


SyntaxError: unexpected EOF while parsing (2652819688.py, line 11)

In [232]:
#### All together now!?

#!/usr/bin/env python3

import os
import glob
import numpy as np
import pandas as pd
from datetime import datetime

#create path for outfile
outdir_path = input("type path for OUTPUT directory, no quotes (hint: /Users/val/Desktop/hop_data/hopdata_cleaned): ")
os.makedirs(outdir_path, exist_ok=True)
os.chdir(outdir_path)
print("Directory for output is set up!")

                     
# Set up for big loop                     
path = input("type path for INPUT directory, no quotes (hint: /Users/val/Desktop/hop_data/july_samples): "
folder = os.fsencode(path)
all_files = glob.glob(os.path.join(path, "*.txt"))

counter = 0
print("files completed:0 ")
     
# Big loop, here we go!

for file in all_files:
    # Read in file
    df_test=pd.read_csv(file, header=None, sep='\t')
             
    # Remove last column, which is all NAs
    df = df.dropna(axis=1, how='all')   
             
    #  Assign every 24 rows to an integer 
    Hours_list = list(range(24))
    df.insert(loc=0,
          column='hour',
          value=Hours_list*24)
             
    # Add bird ID to each 24 rows
    BirdIDs = list(range(1,25,1))
    Birds_list = np.repeat(BirdIDs,24)
    # Add bird ID to each row 
    df.insert(loc=1,
          column='cage',
          value=Birds_list)
    # Change wide dataset to long format         
    df_long = pd.melt(df, id_vars=['cage','hour'], var_name='minute', value_name='hops')
    
    # Add binomial for activity 
    df_long['active'] = np.where((df_long.hops > 0), 1, 0)
             
    # Add phase (night/day) 
    df_long['phase'] = np.where((df_long.hour > 11), "night", "day")
    
    # Convert date to nice format      
    date = os.path.basename(file)
    newdate = date.strip(".txt")
    newdate = newdate.replace(",","")
    newdate = newdate.replace(" ", "_")
    newdate = datetime.strptime(newdate, '%B_%d_%Y')
             
    # Add date as column
    df_long['date'] = newdate      
             
    #separate the date into 3 separate columns 
    df_long['year']= df_long['date'].dt.year
    df_long['month']= df_long['date'].dt.month
    df_long['day']= df_long['date'].dt.day
    
    #write out csv file
    newfilename=str(newdate.date())
    newfilename=str(newdate.date())+'.csv'
    df_long.to_csv(newfilename)
        
    # Track progress
    counter = counter +1
    print(counter)
             
# In the end, combine all new csv files into one mega-file
             




SyntaxError: invalid syntax (3382141112.py, line 20)

In [249]:
#os.chdir(outdir_path)
os.chdir('/Users/val/Desktop/hop_data/hopdata_cleaned')
all_new_files = [i for i in glob.glob(f"*{'.csv'}")]
combined_csv = pd.concat([pd.read_csv(f) for f in all_new_files ])
combined_csv.to_csv( "combined_cleaned.csv", index=False, encoding='utf-8-sig')


NameError: name 'combined_cleaned' is not defined

In [253]:
test = pd.read_csv('combined_cleaned.csv')
print(test.head())
print(test.tail())

   Unnamed: 0  cage  hour  minute  hops  active phase        date  year  \
0           0     1     0       0     0       0   day  2021-07-09  2021   
1           1     1     1       0     2       1   day  2021-07-09  2021   
2           2     1     2       0     3       1   day  2021-07-09  2021   
3           3     1     3       0     3       1   day  2021-07-09  2021   
4           4     1     4       0     0       0   day  2021-07-09  2021   

   month  day  
0      7    9  
1      7    9  
2      7    9  
3      7    9  
4      7    9  
         Unnamed: 0  cage  hour  minute  hops  active  phase        date  \
1555195       34555    24    19      59     0       0  night  2021-07-07   
1555196       34556    24    20      59     0       0  night  2021-07-07   
1555197       34557    24    21      59     0       0  night  2021-07-07   
1555198       34558    24    22      59     0       0  night  2021-07-07   
1555199       34559    24    23      59     0       0  night  2021-07-07 