Initial commit. Bringing over work from depricated repo.

robdmc · Jun 7, 2014 · 1ca5fb1 · 1ca5fb1
1 parent d5fd570
commit 1ca5fb1
Show file tree

Hide file tree

Showing 40 changed files with 3,666 additions and 4 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,2 @@
+.DS_Store
+*.pyc
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,26 @@
+Copyright (c) 2014, Robert deCarvalho
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice, this
+   list of conditions and the following disclaimer. 
+2. Redistributions in binary form must reproduce the above copyright notice,
+   this list of conditions and the following disclaimer in the documentation
+   and/or other materials provided with the distribution.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
+ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+The views and conclusions contained in the software and documentation are those
+of the authors and should not be interpreted as representing official policies, 
+either expressed or implied, of the FreeBSD Project.
diff --git a/README.md b/README.md
@@ -1,7 +1,109 @@
-pandashells
-===========
+PANDASHELLS                           
+===
 
-Bringing the power of python-pandas to the shell prompt
+Description
+-------------------------------------------------------------------------------
+The ptools library was written to bring the power of the python scienctific
+stack to the unix command-line. This allows well-known and time-tested tools 
+like grep, awk, sed, etc. to interact seemlessly with the powerful data
+manipulation, visualization, and statistical libraries being developed in the 
+python data-science community.
 
 
-Coming soon.
+Installation
+--------------------------------------------------------------------------------
+  --- master branch
+  pip install git+https://github.com/robdmc/ptools.git
+
+  --- experimental branch with pandas (very early stage developement
+  pip install git+https://github.com/robdmc/ptools.git@with_pandas
+
+
+List of tools (run with -h for help, --example to see example)
+--------------------------------------------------------------------------------
+ p.df       Pandas dataframe manipulation of csv files
+
+
+*********** here are some new tools I want
+p.lombscargle
+p.mcmc 'patsy model'  (see if there's an easy way to do this)
+                      Maybe make distribution,params,prior for each variable
+                      p.mcmc 'y ~ x + z' 'x:Normal(mu, sigma)', y:Normal(mu,sigma)
+                      think about defaults here where partials don't have noise
+
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+here are some regression and classification ideas.
+
+p.regress - statmodels linear regression with full summary output. maybe use --fit to add fit results to df
+p.learn.regress_linear
+p.learn.regress_ridge
+p.learn.regress_tree
+p.learn.regress_forest
+p.learn.classify.logistic
+p.learn.classify.tree
+p.learn.classify.forest
+p.learn.classify.svm
+
+Always use patsy language
+
+the model.pkl files (which can be user-def names) hold the model as well
+as the string used to do the fit
+
+with --fit model.pkl
+saves model in model.pkl and displays rms R^2 and cross_val scores
+as well as the original string used to do the fit and the type of model
+
+
+with --predict model.pkl
+loads model, input and shows _fit variable to the dataframe
+with --stats, does same thing, but displays rms and R2
+with --hist shows hist of residuals
+with --plot shows fit vs residual
+
+of course classifiers have their own metrics and maybe have a
+--roc that plots the roc curve
+
+with
+--info model.pkl, just shows the model
+
+with --desc 'my desc'  allows you to store a description that will be
+                       displayed with the --info flag
+
+
+
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+
+
+
+ ********** here is list of tools I want to replicate *********************
+p.cov -> covariance between collumns.  cols and index have respective names
+*p.parallel
+*p.plot
+*p.geoCode
+*p.crypt
+p.bar
+p.cdf
+p.color
+p.fft
+p.lombscargle
+p.hist
+p.interp # cat xvals_file | p.interp -r .6 -t <(cat table_file.txt)
+p.linspace
+p.map
+p.mapDots2html
+p.mapPoly2html
+p.mongoDump
+p.normalize
+p.pgsql2csv
+p.pie
+p.rand
+p.regress
+p.scat
+p.server
+p.shuffle
+p.sigEdit
+p.smooth   lowess, spline, medianFilter
+p.sshKeyPush
+p.template
+p.utc2local
diff --git a/ideas.txt b/ideas.txt
@@ -0,0 +1,43 @@
+p.regress - statmodels linear regression with full summary output
+p.learn.regress_linear
+p.learn.regress_ridge
+p.learn.regress_tree
+p.learn.regress_forest
+p.learn.classify.logistic
+p.learn.classify.tree
+p.learn.classify.forest
+p.learn.classify.svm
+
+Always use patsy language
+
+the model.pkl files (which can be user-def names) hold the model as well
+as the string used to do the fit
+
+with --fit model.pkl
+saves model in model.pkl and displays rms R^2 and cross_val scores
+as well as the original string used to do the fit and the type of model
+
+
+with --predict model.pkl
+loads model, input and shows _fit variable to the dataframe
+with --stats, does same thing, but displays rms and R2
+with --hist shows hist of residuals
+with --plot shows fit vs residual
+
+of course classifiers have their own metrics and maybe have a
+--roc that plots the roc curve
+
+with
+--info model.pkl, just shows the model
+
+with --desc 'my desc'  allows you to store a description that will be
+                       displayed with the --info flag
+
+
+
+
+
+
+
+
+
diff --git a/pandashells/__init__.py b/pandashells/__init__.py
diff --git a/pandashells/bin/.p.rand.swp b/pandashells/bin/.p.rand.swp
diff --git a/pandashells/bin/p.config b/pandashells/bin/p.config
@@ -0,0 +1,52 @@
+#! /usr/bin/env python
+
+#--- standard library imports
+import os
+import sys
+import argparse
+
+############# dev only.  Comment out for production ######################
+sys.path.append('../..')
+##########################################################################
+
+
+from ptools.lib import config_lib
+
+
+if __name__ == '__main__':
+
+    #--- read in the current configuration
+    default_dict = config_lib.get_config()
+
+    msg = "Need to write this. "
+    msg += "and write more."
+
+    #--- populate the arg parser with current configuration
+    parser = argparse.ArgumentParser(
+            description=msg)
+    parser.add_argument('--force_defaults', action='store_true',
+             dest='force_defaults',
+            help='Force to default settings')
+    for tup in config_lib.CONFIG_OPTS:
+        msg = 'opts: '+str(tup[1])
+        parser.add_argument('--%s'%tup[0], nargs=1, type=str,
+                dest=tup[0], metavar='',#default_dict[tup[0]],
+                default=[default_dict[tup[0]]], choices=tup[1], help=msg)
+
+    #--- parse arguments
+    args = parser.parse_args()
+
+    #--- set the arguments to the current value of the arg parser
+    config_dict = {t[0]:t[1][0] for t in args.__dict__.iteritems()
+            if not t[0] in ['force_defaults']}
+
+    if args.force_defaults:
+        config_dict = config_lib.DEFAULT_DICT
+    config_lib.set_config(config_dict)
+
+    print '\n Current Config'
+    print '  ' + '-'*40
+    for k in sorted(config_dict.keys()):
+        if not k in ['--force_defaults']:
+            print '  {: <20} {}'.format(k+':', config_dict[k])
+
diff --git a/pandashells/bin/p.crypt b/pandashells/bin/p.crypt
@@ -0,0 +1,53 @@
+#! /usr/bin/env python
+
+#--- standard library imports
+import os
+import sys
+import argparse
+import re
+
+############# dev only.  Comment out for production ######################
+sys.path.append('../..')
+##########################################################################
+
+from ptools.lib import arg_lib
+
+#=============================================================================
+if __name__ == '__main__':
+    msg = "Encrypt a file with aes-256-cbc as implemented by openssl. "
+
+    #--- read command line arguments
+    parser = argparse.ArgumentParser(
+            description=msg)
+
+    arg_lib.addArgs(parser, 'example')
+
+    parser.add_argument('-i', '--inFile', nargs=1, type=str,
+            required=True, dest='inFile', metavar='inFileName',
+            help="The input file name")
+
+    parser.add_argument('-o', '--outFile', nargs=1, type=str,
+            required=True, dest='outFile', metavar='outFileName',
+            help="The output file name")
+
+    parser.add_argument('-d', '--decrypt', action='store_true', default=False,
+           dest='decrypt', help='Decrypt the input file into the output file')
+
+    #--- parse arguments
+    args = parser.parse_args()
+
+    #--- make sure input file exists
+    if not os.path.isfile(args.inFile[0]):
+        sys.stderr.write("\n\nCan't find input file\n\n")
+        sys.exit(1)
+
+    #--- create a dycryption command if requested
+    if args.decrypt:
+        cmd = "cat %s | openssl enc -d -aes-256-cbc > %s" % (args.inFile[0],
+                                                                  args.outFile[0])
+    #--- otherwise just encrypt
+    else:
+        cmd = "cat %s | openssl enc -aes-256-cbc -salt > %s" % (args.inFile[0],
+                                                                  args.outFile[0])
+    #--- run the proper openssl command
+    os.system(cmd)
diff --git a/pandashells/bin/p.df b/pandashells/bin/p.df
@@ -0,0 +1,93 @@
+#! /usr/bin/env python
+
+#--- standard library imports
+import os
+import sys
+import argparse
+import re
+
+############# dev only.  Comment out for production ######################
+sys.path.append('../..')
+##########################################################################
+
+from ptools.lib import module_checker_lib, arg_lib, io_lib
+
+#--- import required dependencies
+modulesOkay = module_checker_lib.check_for_modules(
+        [
+            'pandas',
+            'numpy',
+            'scipy',
+            'dateutil',
+            'matplotlib',
+        ])
+if not modulesOkay:
+    sys.exit(1)
+
+import pandas as pd
+import numpy as np
+import scipy as scp
+import pylab as pl
+from dateutil.parser import parse
+import datetime
+
+#=============================================================================
+if __name__ == '__main__':
+    msg = "Bring pandas manipulation to command line.  Input from stdin "
+    msg += "is placed into a dataframe named 'df'.  The output of each "
+    msg += "specified command must evaluate to a dataframe that will "
+    msg += "overwrite 'df'. The output of the final command will be sent "
+    msg += "to stdout.  The namespace in which the commands are executed "
+    msg += "includes pandas as pd, numpy as np, scipy as scp, pylab as pl, "
+    msg += "dateutil.parser.parse as parse, datetime"
+
+    #--- read command line arguments
+    parser = argparse.ArgumentParser(
+            description=msg)
+
+    options = {}
+    arg_lib.addArgs(parser, 'io_in', 'io_out', 'example')
+    parser.add_argument("statement", help="Statement to execute", nargs="+")
+
+    #--- parse arguments
+    args = parser.parse_args()
+
+    #--- get the input dataframe
+    df = io_lib.df_from_input(args)
+
+    #--- define regex to identify if supplied command is for col assignment
+    rex_col_cmd = re.compile(r'.*?df\[.+\].*?=')
+
+    #--- define regex to identify plot commands
+    rex_plot_cmd = re.compile(r'.*(plot|hist)\(.*\).*')
+
+    #--- execute the statements in sequence
+    for cmd in args.statement:
+        #--- if this is a column-assignment command, just execute it
+        if rex_col_cmd.match(cmd):
+            exec(cmd)
+            temp = df
+        #--- if this is a plot command, execute it and quit
+        elif rex_plot_cmd.match(cmd):
+            exec(cmd)
+            pl.show()
+            sys.exit(0)
+
+        #--- if instead this is a command on the whole frame
+        else:
+            #--- put results of command in temp var
+            cmd = 'temp = {}'.format(cmd)
+            exec(cmd)
+
+        #--- transform results to dataframe if needed
+        if isinstance(temp, pd.DataFrame):
+            df = temp
+        else:
+            try:
+                df = pd.DataFrame(temp)
+            except pd.core.common.PandasError:
+                print temp
+                sys.exit(0)
+
+    #--- write dataframe to output
+    io_lib.df_to_output(args, df)